Custom scoring matrices for AT-rich genomes, part 4.

Due Monday, March 10th (before class), as an emailed worksheet.

Groups: This lab will be done individually, but some collaboration is encouraged. You must do part 4a yourself - i.e. you can consult with others but you have to write your own code, you cannot just copy it from someone else.


This lab series is somewhat complicated, by far the most complicated that we will do. Here is an overview (more details for each step are given later):
  1. Compute amino acid frequencies for the organism
  2. Generate alignments of some highly conserved proteins
    1. Prepare FASTA sequence files: remove the AT rich genomes of the other three organisms from each sequence file
    2. Create the alignments: use Clustalw to create alignments
    3. Save the alignments on whatever server you are using
  3. Create scoring matrix
    1. Load the appropriate modules from biopython
    2. Count transitions of amino acids from other organisms to your AT-rich organism
    3. Create a log-odds scoring matrix
    4. Normalize the matrix to have the same entropy as BLOSUM62
    5. Save the matrix
  4. Use the scoring matrix (this week)
    1. Modify Needleman-Wunsch program to do Smith-Waterman local alignments.
    2. Choose two known proteins from your organism from the previous labs
    3. Find orthologs in another species
    4. Use your scoring matrix and BLOSUM62 to generate Smith-Waterman alignments of the two proteins
    5. Compare the resulting scores and alignments

Part 4