Due Monday, April 14th. Bring a hard-copy of your results, and be prepared to (briefly, 10-15 minutes) share your findings with the class.

Groups:

Everyone should read the following article:Mammalian evolution and biomedicine: new views from phylogeny (pubmed id: 17624960).
Supplemental optional readings:


You will do a first attempt at a phylogenetic hypothesis by picking a representative protein or DNA sequence and using UGPMA (unweighted group-pair method with arithmetic mean) and neighbor-joining.
Note that choosing and aligning the sequences is largely independent of the programming part of the this assignment, and these parts can mostly be done by different group members.

Feel free to ask me for help.
  1. Choose a DNA sequence with known protein orthologs in each of the five given species. Do not use a cytochrome oxidase or a ribosomal protein (such as SSU proteins). You want to choose something with a moderate level of variability between the species - something like a histone might not vary enough, whereas an immunoglobulin might vary too much. Start with the least-studied species of the five (you are allowed to substitute closely related species if it helps - for example, it is acceptable to use Rattus norvegicus instead of Mus musculus). Explain your choice. It may be helpful to use NCBI's Taxonomy database, with some of the optional items checked such as showing the number of Nucleotide and Protein sequences available at each taxonomic level.
  2. Get a multiple alignment of the five DNA sequences. The ClustalW server at EMBL-EBI is one recommended option. To get better results for your phylogeny, you may wish to trim some sequences to a well-aligned region - often some sequences are much longer than others, and these long gaps can distort measures of similarity and the quality of the multiple alignment. The well-aligned parts of your sequences should be at least 200-300 nucleotides and probably less than 20000.
  3. Write a program that outputs the neighbor-joined tree in Newick format from a distance matrix. The file here will get you started - it has a program that outputs the UPGMA tree (also available from the published list on the 5233 server). Here is an example DNA .aln file you can use as practice (it has 6 species).
  4. Use these programs to compute the UPGMA and neighbor-joined tree for your data, using Jukes-Cantor or Kimura distances.
  5. Are your results qualitatively and/or quanitatively consistent with the literature? Are there any controversial phylogenies that relate to your theme?