Custom scoring matrices for AT-rich genomes, part 1.

Due Friday, February 22nd, as an emailed sws file and in-class presentation.

Groups:
Group 1: Seth, Sarah, Gregg; organism: Plasmodium falciparum.
Group 2: Annete, Huimin, (Matthew); organism: Dictylostelium discoideum.
Group 3: Lindsey, Feng, Jonathan; organism: Mycoplasma mycoides.
Group 4: Shanshan, Hillary, Kristin; organism: Brugia malayi.

We depart somewhat from our theme of mammalian evolution to study the predicted proteome of four unusual organisms. All of these have extremely A+T rich genomes, which may result in unusual amino acid distributions in their proteins. The purpose of this lab is to investigate these distributions and use them to construct new scoring matrices adapted to these organisms.
The relevant protein files are online at:

http://www.d.umn.edu/~mhampton/PlasProtsRef.fa for Plasmodium falciparum.
http://www.d.umn.edu/~mhampton/DictyosteliumProts.fa for Dictylostelium discoideum.
http://www.d.umn.edu/~mhampton/MycoplasmaProts.fa for Mycoplasma mycoides.
http://www.d.umn.edu/~mhampton/BrugiaProts.fa for Brugia malayi.

These are in FASTA format so you can get them parsed in Sage with code similar to last week (with the correct url substituted):

import urllib2
online_file = urllib2.urlopen(
url)
from Bio import SeqIO
my_parser = SeqIO.parse(online_file,'fasta')

You can extract records from the parser as before (iterating in a for loop).
  1. Find the distribution of amino acids in the given proteome. The normal amino acids 1-letter codes are: ACEDGFIHKMLNQPSRTWVY. Sometimes X denotes an unknown amino acid - you can ignore any non-standard letters. Compare this distribution to that of the human proteome (at http://www.d.umn.edu/~mhampton/HomoProts.fa), and the BLOSUM62 distribution which you can find on page 908 of this paper by Yu and Altschul (pubmed id 15509610).
    Does the high A+T content of the genome affect the amino acid distribution in proteins?
    What are some ways you could investigate that relationship further?

    One thing that may be helpful in thinking about the relationship between A+T content of the genome and the amino acid composition is the codon-to-amino-acid map and its inverse. Biopython has that somewhat built-in:
    from Bio.Data import CodonTable
    t = CodonTable.standard_dna_table
    Now t.forward_table is a dictionary of codons (the keys) to amino acids (the values). It may be useful to you to have the reverse map; one way is to define the following dictionary, whose keys are the amino acid letters and whose values are lists of codons:
    inv_dict = {}
    for key in t.forward_table.keys():
        amino = t.forward_table[key]
        if inv_dict.has_key(amino):
            inv_dict[amino].append(key)
        else:
            inv_dict[amino] = [key]
    

  2. For the 22nd, prepare to present your results. Provide a brief overview of your group's organism - its life cycle, scientific interest, basic taxonomy, properties of its genome, and any other interesting facts you find. What are possible reasons for the extreme A+T values in the genome? Finally, describe your lab results.