Appendices: Probability, File Formats, Complexity notation

These appendices provide some reference material that may be useful in understanding this text. If you are unfamiliar with the probability and statistics presented here, it may be very helpful to look at the text mentioned in the introduction: Modern Statistics for Modern Biology, by Susan Holmes and Wolfgang Huber.

Appendix: Useful probability distributions

Binomial and geometric distributions

The binomial distribution gives the probability of m successes out of n trials in which each success occurs with probability p:

P(m,n) = \frac{n!}{m!(n-m)!} p^m(1-p)^{n-m}.

For example, the chance of getting exactly 5 heads after flipping a fair coin (so p=0.5) ten times is P(5,10) = \dfrac{10!}{5!5!} (0.5)^5 (0.5)^5 \approx 0.246.

The geometric distribution gives the probability of the first occurence of a success in a series of n trials - that is, the probability of n-1 failures followed by a success. As with the binomial distribution, each trial is independent and is successful with probability p:

G(n) = (1-p)^{n-1} p.

The mean of this distribution is \dfrac{1}{p}, and its variance is \dfrac{1 - p}{p^2}.

As an example, if we achieve a success by rolling a 6 with a six-sided die, the expected number of rolls needed will be 6. The probability of needing exactly six rolls is G(6) = (1-\dfrac{1}{6})^5 \dfrac{1}{6} \approx 0.067. (Note that P(1) = \dfrac{1}{6} \approx 0.167, more likely than the probability of the mean itself.)

Poisson and exponential distributions

The Poisson distribution is: $$ P(k;\lambda) = \dfrac{\lambda^k e^{-\lambda}}{k!}. $$

The exponential distribution gives the probability of a time interval x between two events from a Poisson process:

P(x;\lambda) = \left \{ \begin{array}{cc} \lambda e^{-\lambda x} & x \ge 0, \\ 0 & x < 0. \end{array} \right .

Negative binomial distribution

The negative binomial distribution is a generalization of the geometric distribution, in which we wait for m successes to occur in a series of Bernoulli trials, which have success probability p and failure probability 1-p. (Note that this means that the last event is always a success.) Then the probability of this taking n trials is

NB(n;m,p) = \dfrac{(n-1)!}{(m-1)! (n-m)!} p^m (1-p)^{(n-m)}.

This distribution has mean \dfrac{m}{p} and variance \dfrac{m (1-p)}{p^2}. Note that other texts sometimes define this distribution somewhat differently, for example by defining it to be the probability of the number of failures rather than the total number of trials, or by fixing the number of failures instead of successes.

Appendix: File Formats

FASTA and FASTQ

The FASTA format is a very simple file format for sequence information. It was originally developed as a format for a particular software program (also called FASTA), but is now the most standard human-readable format for sequences. Sequences are delimited by description lines which must start with the character >, followed by the sequence with no additional information. Although the description line can be abused to add other metadata, in general this format is not suitable for retaining complex information such as alignment data.

Here are some simple (short) examples of FASTA records of a DNA and a protein sequence. They have line-breaks in the sequence after 60 characters, but this is an arbitrary choice chosen for readability.

>Example - a simple FASTA sequence example
CACACGTAGGGAACCGAAGTTAAAGAGGTGGTAAACAGCGAGTCAACACGAGTTGCGGCG
ACCCGTTGATCTGGTGCCGGGCGGAAAATTCGTTAATGTGCAGTTTCGAACAGATCAAAG

>gi|386828|gb|AAA59172.1| insulin [Homo sapiens]
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

FASTQ extends the FASTA format to allow for quality scores for each letter of sequence (usually nucleotide). It uses four lines per sequence:

  1. A description line which starts with the character @.

  2. The sequence itself.

  3. A line beginning with + followed by additional descriptive information about the sequence

  4. A line containing as many characters as the sequence, which represent quality scores. The quality scores are usually encoded by the ACSII ranking, in other words the order is, from high to low:

    !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVW
    XYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~}
    

Here is an example of a single entry of a FASTQ file:

@Example sequence ID
CACACGTAGGGAACCGAAGTTAAAGAGGTGGTAAACAGCGAGTCAACACGAGTTGCGGCG
+
-A9FR0?Y7<AHT67CN=-;38/HSWN;X4/14IK/X2A9TSADCD1K97;.55W3VFL9

SAM and BAM

SAM is an acronym for Sequence Alignment/Map format. It is most suitable for alignments of sequencing reads to a reference which is not too different from the mapped reads. BAM is simply the binary version of this format. The formal specification for the format can be found at https://samtools.github.io/hts-specs/.

Currently Biopython does not have extensive support for the SAM/BAM formats, but there is a python module called pysam which wraps the standard commandline utility, samtools.

Genbank

The Genbank format, more formally the DDBJ/EMBL/GenBank Feature Table format, is quite flexible and used for a variety of more descriptive records. It is most commonly used for primary DNA, RNA, and protein sequence data. The full specification can be found at http://www.insdc.org/files/feature_table.html. The Biopython SeqIO module can read and parse the features of Genbank files.

Here is a short example of an mRNA database entry from NCBI in the Genbank format:

LOCUS       NM_001040071             651 bp    mRNA    linear   PRI 25-AUG-2012
DEFINITION  Homo sapiens seven transmembrane helix receptor (LOC650293), mRNA.
ACCESSION   NM_001040071 XM_939379
VERSION     NM_001040071.1  GI:91206427
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 651)
  AUTHORS   Rose,J.E., Behm,F.M., Drgon,T., Johnson,C. and Uhl,G.R.
  TITLE     Personalized smoking cessation: interactions between nicotine dose,
            dependence and quit-success genotype score
  JOURNAL   Mol. Med. 16 (7-8), 247-253 (2010)
   PUBMED   20379614
  REMARK    GeneRIF: Clinical trial of gene-disease association and
            gene-environment interaction. (HuGE Navigator)
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from AB065698.1.
            On Apr 8, 2006 this sequence version replaced gi:88978953.
FEATURES             Location/Qualifiers
     source          1..651
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /db_xref="taxon:9606"
                     /chromosome="4"
                     /map="4p16.1"
     gene            1..651
                     /gene="LOC650293"
                     /note="seven transmembrane helix receptor"
                     /db_xref="GeneID:650293"
     CDS             1..651
                     /gene="LOC650293"
                     /codon_start=1
                     /product="seven transmembrane helix receptor precursor"
                     /protein_id="NP_001035160.1"
                     /db_xref="GI:91206428"
                     /db_xref="GeneID:650293"
                     /translation="MYLVTVLRNLLSILAVSSDSPLHTPMYFFLSNLCWADIGFTSAM
                     VPKMIVDMQSHSRVISHEGCLTQMFFLVLFACIEGMILTVMAYDCFVAICRPLNYPVI
                     VNPHLCVFFILMSFFLSLLDSQLHSWIVLQFTIIKNVEISNFVCDPSQLLKLACSDSV
                     INSIFTYFHSTMFAFLPISAILLSYYKIVTSILRISSSDGKYKAFSTCDSHLAVVC"
     sig_peptide     1..54
                     /gene="LOC650293"
                     /inference="COORDINATES: ab initio prediction:SignalP:4.0"
     exon            1..651
                     /gene="LOC650293"
                     /inference="alignment:Splign:1.39.8"
ORIGIN      
        1 atgtatctgg tcacggtgct gaggaacctc ctcagtatcc tggctgtcag ctctgactcc
       61 cccctccaca cccccatgta cttcttcctc tccaacctgt gctgggctga catcggtttc
      121 acctcggcca tggttcccaa gatgattgtg gacatgcagt cgcatagcag agtcatctct
      181 catgagggct gcctgacaca gatgtttttc ttggtccttt ttgcatgtat agaaggcatg
      241 atcctgactg tgatggccta tgactgcttt gtagccatct gtcgccctct gaattaccca
      301 gtcatcgtga atcctcacct ctgtgtcttc ttcattttga tgtccttttt ccttagcctg
      361 ttggattccc agctgcacag ttggattgtg ttacaattca caatcatcaa gaatgtggaa
      421 atctctaatt ttgtctgtga cccctctcaa cttctcaaac ttgcctgttc tgacagcgtc
      481 atcaatagca tattcacata tttccatagt actatgtttg cttttcttcc catttcagca
      541 atccttttat cttactataa aatcgtcacc tccattctca ggatttcatc ttcagatggg
      601 aagtataaag ccttctccac ctgtgactct cacctagcag ttgtttgctg a
//

Alignment formats: Clustal and Phylip

The Clustal series of programs introduced one of the most common alignment file formats; there are some variations in this format, such as whether or not the alignment site counts are included or not. Here is an example (using sequences of insulin precursors from a fish, a frog, a bird, and a mammal) of the numbered Clustal Omega format:

CLUSTAL O(1.2.4) multiple sequence alignment


NP_571131.1         MAVWLQAGALLVLLVVSSVSTN-PGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLL    59
XP_002920166.1      MALWTRLLPLLALLAVWAPVPARTFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVED    60
NP_001079350.1      MALWMQCLPLVLVLLFSTPNTE-ALANQHLCGSHLVEALYLVCGDRGFFYYPKIKRDIEQ    59
NP_990553.1         MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARRDVEQ    60
                    **:* :   *: :* .         . *********:*******  **** **   :

NP_571131.1         GFLPPKSAQETEVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN   108
XP_002920166.1      LPAGDAELDRVPGADPQPRALAGALQRRGIVEQCCTSICSLYQLENYCN   109
NP_001079350.1      AQVNGPQDNELDGMQFQPQE--YQKMKRGIVEQCCHSTCSLFQLENYCN   106
NP_990553.1         PLVSSPLRGEAGVLPFQQEE--YEKVKRGIVEQCCHNTCSLYQLENYCN   107
                             .        .       :******** . **:::*:****

The Phylip phylogenetic program suite created by Joseph Felsenstein uses its own format, which has the somewhat annoying requirement of only allowing 10 characters for each sequence name. Here is the same alignment of insulin sequences in Phylip format:

 4 109
NP_571131. MAVWLQAGAL LVLLVVSSVS TN-PGTPQHL CGSHLVDALY LVCGPTGFFY
XP_0029201 MALWTRLLPL LALLAVWAPV PARTFVNQHL CGSHLVEALY LVCGERGFFY
NP_0010793 MALWMQCLPL VLVLLFSTPN TE-ALANQHL CGSHLVEALY LVCGDRGFFY
NP_990553. MALWIRSLPL LALLVFSGPG TSYAAANQHL CGSHLVEALY LVCGERGFFY

           NPKRDVEPLL GFLPPKSAQE TEVADFAFKD HAELIRKRGI VEQCCHKPCS
           TPKARREVED LPAGDAELDR VPGADPQPRA LAGALQRRGI VEQCCTSICS
           YPKIKRDIEQ AQVNGPQDNE LDGMQFQPQE --YQKMKRGI VEQCCHSTCS
           SPKARRDVEQ PLVSSPLRGE AGVLPFQQEE --YEKVKRGI VEQCCHNTCS

           IFELQNYCN
           LYQLENYCN
           LFQLENYCN
           LYQLENYCN

Phylogeny formats: Newick, Nexus, NeXML, PhyloXML

The Nexus and Newick formats are used to describe trees (in the graph theory sense of acyclic connected graphs).

The Newick format was created in 1986 by many of the pioneers of molecular phylogenetics; it is named after the restaurant (Newick's) in Dover, New Hampshire, where they met.

The Nexus format was officially described in 1997 although it had been under development for a decade beforehand. It is a much broader and extendable data format than Newick - internally trees are stored in Newick format, along with many other possible types of data and metadata such as sequence alignments . Newick and Nexus formats are analogous to FASTA and GenBank formats with corresponding pros and cons of size, complexity, and flexibility.

Although the Nexus format can contain a variety of information, its format made some types of additions awkward. The more flexible and grammatically precise, but also somewhat more cumbersome format of PhyloXMLwas developed to address these shortcomings. Because it is a well-defined XML language format it can take advantage of many existing XML parsing and processing programming libraries. Similarly, NeXML is another XML-based extension developed around the same time as PhyloXML.

Appendix: Complexity Notation (big O)

There is an extensive notation for efficiently describing the behavior of functions, which is often used to characterize the complexity of algorithms. It is sometimes called "Big-O" notation, after the most commonly used piece of it, but there is quite a wide variety of refinements (see the Wikipedia article for a more extensive description).

A function f(x) is O(g(x)) if it can eventually (for large values of x) be bounded by a fixed multiple of g(x), i.e. if there exists a x_0 such that

|f(x)| \le M g(x)

for all x > x_0.

Example of big-O notation

For example, the function f(x) = 2x^3 + x + 1 is O(x^3), where in the definition we could choose any M > 2.