Due Friday March 14th, as an emailed worksheet.
You can work alone or in a group of your choosing of up to 3 people. As usual, do not hesitate to ask me for help, either in the lab or via email or office hours.

This lab is on finding patterns in protein sequences, specifically the regular expression patterns found at the database Prosite. We will use the python module 're', the regular expression module (described in more detail here) .
  1. Go to Prosite and find the entries PS00768 and PS00222. Read the documentation pages, and find the consensus patterns. We will search the human reference proteins for these patterns. To do this, we will need to load the re module:
    import re 
    Next we need to convert the Prosite pattern into a more standard regular expression syntax. The following function will convert most patterns, stripping out the redundant ' - ' strings and converting the escape characters:
    def PStoRE(PrositePattern):
        rePattern = PrositePattern
        rePattern = rePattern.replace(' - ','')
        rePattern = rePattern.replace('x','.')
        rePattern = rePattern.replace('{','[^')
        rePattern = rePattern.replace('}',']')
        rePattern = rePattern.replace('(','{')
        rePattern = rePattern.replace(')','}')
        return rePattern
    So for example if we had the Prosite consensus pattern for PS00198 as a string, we would convert it:
    PS00198 = 'C - x - {P} - C - {C} - x - C - {CP} - x - {C} - C - [PEG]'
    PS00198re = PStoRE(PS00198)
    And now we 'compile' it into a searchable regular expression:
    PS00198compiled = re.compile(PS00198re)
    After loading the human protein file and the Fasta parsing module:
    import urllib2
    homo_prot_file = urllib2.urlopen('http://www.d.umn.edu/~mhampton/HomoProts.fa')
    from Bio import Fasta
    
    We are ready to search for this pattern:
    fasta_parser = Fasta.RecordParser()
    fasta_iterator = Fasta.Iterator(homo_prot_file, parser = fasta_parser)
    for record in fasta_iterator:
        matches = PS00198compiled.findall(record.sequence)
        if len(matches) != 0:
            print record.title
            print matches
    	print ''      # this just makes a newline to seperate records
    This would print the description line of each protein record that had any matches to the regular expression.
    Modify the above example to search for the PS00768 and PS00222 patterns instead. You will need to reset the file and parser each time with
    homo_prot_file = urllib2.urlopen('http://www.d.umn.edu/~mhampton/HomoProts.fa')
    fasta_parser = Fasta.RecordParser()
    fasta_iterator = Fasta.Iterator(homo_prot_file, parser = fasta_parser)
    
    before iterating over the records again.

    Answer the following questions:
    1. Do all the matches seem correct?

    2. If the amino acids were uniformly and independently distributed (i.e. each occuring randomly with a 1/20 chance per position), how many human proteins would you expect to match each pattern?

    3. How much can you remove part of the beginning or end of the patterns before getting lots of false positives?

    4. How do the matches compare with the matches you get using Blastp, restricted to Homo sapiens, with one of the match segments as a query?


  2. Extra credit mystery identification: what is this protein fragment?