Custom scoring matrices for AT-rich genomes, part 2 and 3.

Due Monday, March 3rd (before class), as an emailed worksheet.

Groups:
Group 1: Seth, Sarah, Gregg; organism: Plasmodium falciparum.
Group 2: Annette, Huimin, Matthew; organism: Dictylostelium discoideum.
Group 3: Lindsey, Feng, Jonathan; organism: Mycoplasma mycoides.
Group 4: Shanshan, Hillary, Kristin; organism: Brugia malayi.

This lab series is somewhat complicated, by far the most complicated that we will do. Here is an overview (more details for each step are given later):
  1. Compute amino acid frequencies for the organism (last week's lab)
  2. Generate alignments of some highly conserved proteins
    1. Prepare FASTA sequence files: remove the AT rich genomes of the other three organisms from each sequence file
    2. Create the alignments: use Clustalw to create alignments
    3. Save the alignments on whatever server you are using
  3. Create scoring matrix
    1. Load the appropriate modules from biopython
    2. Count transitions of amino acids from other organisms to your AT-rich organism
    3. Create a log-odds scoring matrix
    4. Normalize the matrix to have the same entropy as BLOSUM62
    5. Save the matrix
  4. Use the scoring matrix (next week or later)
    1. Choose a known protein from your organism
    2. Find an ortholog in another species
    3. Use your scoring matrix and BLOSUM62 to generate Smith-Waterman alignments of the two proteins
    4. Compare the resulting scores and alignments

Part 2


Use a text editor to remove the other groups' organisms from each fasta file, and then use Clustalw to get a multiple alignment for each protein.
FASTA files for all groups: atpsynthBs.fa, elgmt.fa, hsp70.fa, glykin.fa.

The ClustalW alignment will have a url such as "http://www.ebi.ac.uk/cgi-bin/jobresults/clustalw/clustalw-20080220-18181032.aln". To save it locally to your server, you could do something like:

import urllib2
online_file = urllib2.urlopen(
url)
username = 'admin' # change this to your own username
filepath = os.environ['HOME'] + '/sage_notebook/worksheets/' + username +'/'
server_file = file(filepath + 'myfilename.aln','w')
server_file.write(online_file.read())
online_file.close()
server_file.close()

The following fasta files have already been aligned, so it is possible to begin to create the scoring matrix before finishing part 1. They also might help understand what to do with the other files.
Group 1: Plasmodium falciparum files: plasmyosins.fa Alignment: plasmyosins.aln
Group 2: Dictylostelium discoideum: dictymyosins.fa Alignment: dictymyosins.aln
Group 3: Mycoplasma mycoides: mycomitot1b.fa Alignment: mycomitot1b.aln
Group 4: Brugia malayi: brugiamyosins.fa Alignment: brugiamyosins.aln

EBI ClustalW server

Part 3

Load the Clustalw and AlignInfo modules:
from Bio.Align import AlignInfo 
from Bio import Clustalw 
Now you can load a multliple alignment using something like:
align1 = Clustalw.parse_file(filepath + 'myfilename.aln') 

It is now relatively easy to extract information from the multiple alignment. For example, you can get the number of columns in the alignment with the get_alignment_length() function, or extract column number N with get_column(N).
To construct the scoring matrix, first we need to compute the transition matrix. We will do this in a way similar to the BLOSUM matrices, as described in Chapter 4. Our matrix will not be symmetric, however - we are counting changes from the last entry to all the other entries. As an example:
Aij_dict = {}
for i in range(align1.get_alignment_length()): 
    column = align1.get_column(i) 
    if column[-1] != '-': 
        for letter in column[:len(column)-1]: 
            if letter != '-': 
                curkey = (column[-1], letter)
                if Aij_dict.has_key(curkey): 
                    Aij_dict[curkey] += 1 
                else: 
                    Aij_dict[curkey] = 1 

Its a good idea to add pseudocounts - that is, add in at least 1 count for each possible transition:
letterlist = list(union([x[0] for x in Aij_dict.keys()]))
for letter1 in letterlist:
    for letter2 in letterlist:
        curkey = (letter1, letter2) 
        if Aij_dict.has_key(curkey): 
            Aij_dict[curkey] += 1 
        else: 
            Aij_dict[curkey] = 1

Now you should divide the Aij_dict entries by the total number of transitions, to get a dictionary of frequencies (corresponding to the book's q_{ij} matrix). Then create the scores as the log(q_{ij}/p_i q_j), where p_i is the frequency of amino acid i in the AT-rich organism (from last week's lab) and q_i is the BLOSUM62 amino acid frequency (from the Yu and Altschul paper). Finally, scale your matrix so that its entropy is equal to the entropy of the BLOSUM62 matrix. The entropy of a scoring matrix is the sum (over all i and j) of q_{ij} S_{ij} - that is, the sum of all the scores weighted by the transition frequencies.
  • How does your matrix compare to the BLOSUM62 matrix? (What entries are much higher or lower? How do scores between amino acids with different properties (hydrophobicity, size, charge) compare between the two?)
    Here is a way to access the BLOSUM62 matrix in a dictionary format:
    from Bio.SubsMat import MatrixInfo
    b62 = MatrixInfo.blosum62
    for key in b62.keys():
        b62[(key[1],key[0])] = b62[key]