Processing alignments and the Jukes-Cantor measure of evolutionary change.

Due Wednesday, February 13th (1 per group).

Groups:
Group 1: Seth, Huimin, Kristin
Group 2: Annete, Feng, Gregg
Group 3: Lindsey, Matthew, Hillary
Group 4: Shanshan, Sarah, Jonathan

There are several goals of this weeks lab:
(1) To better understand the Jukes-Cantor model of sequence evolution by applying it to some actual sequences;
(2) learn to use biopython to parse two of the most common file formats in bioinformatics: FASTA and ClustalW.

As always, do not hesitate to ask for help from me or your partners.

Possibly useful webpages: the EBI ClustalW server (for generating new multiple alignments), NCBI Entrez, Python Tutorial , Biopython docs on sequence input/output.

  1. As a warmup, check out an entry in the Nucleotide database, and select "FASTA" at the "Display" option. The FASTA format is very simple: there are description lines starting with ">", and all other lines are assumed to contain letters in the sequence(s).

  2. Compute the Jukes-Cantor distance between each of the sequences in the file this file. You can access the records in the file by executing the following in cell:

    import urllib2
    online_file = urllib2.urlopen('http://www.d.umn.edu/~mhampton/nd4s.fa')
    from Bio import SeqIO
    my_parser = SeqIO.parse(online_file,'fasta')

    The parser my_parser can now access the records in the file. The following loop extracts the descriptions and sequences of each record, and puts each into a list (copy and execute):

    descriptions = []
    sequences = []
    for record in my_parser:
    descriptions.append(record.description)
    sequences.append(record.seq.tostring())
    online_file.close()


    (The last line closes the connection to the online file since we don't need it anymore.) To check that all is well, try:

    for stuff in descriptions: print stuff

    These DNA sequences are all the same length and all code for NADH dehydrogenase subunit 4, a mitochondrial protein. For each pair of sequences, compute the fraction of bases that differ, D, and use that to compute the Jukes-Cantor distance d = -3*log(1 - 4D/3)/4 between them.

    Do your answers make sense for these species? Why or why not?

    With the assumptions of the Jukes-Cantor model, if the human-chimp split occurred 7 million years ago, when did the human-mouse split occur?

    Programming note: to access the ith base in the jth sequence use the following syntax: sequences[j][i]


  3. Repeat the above exercise for the TRIM5 sequences in this file. This ".aln" file is in ClustalW format, named for a commonly used multiple sequence alignment program. To parse this file, modify a copy of your previous code as follows:

    my_parser = SeqIO.parse(online_file,'clustal')

    i.e. the parse format is 'clustal' instead of 'fasta'.

    The only difference in your analysis is that you should not count gap mismatches - i.e., if the character is '-', ignore it and do not include that in the length of the sequence. For example, the two sequences:

    ATTG-G-CT
    ATTGCGCGT

    would be considered to have 1 mismatch out of 7 bases, so D = 1/7 (NOT 1/9 or 3/9!).

  4. Finally, choose ONE of the following additional exercises:
    1. Repeat exercise 2 for a different gene. You do not have to use exactly the same species, but you should include at least: human, chimp, another primate, and the mouse, and have at least 5 species total. You will probably want to use the Clustal server linked to at the top of this page to generate an alignment.
    2. Repeat exercise 1 and 2 using the Kimura model instead of Jukes-Cantor (equation 4.13 on page 63 of our text). How does this affect the results?