Probalign is a sequence alignment tool that calculates a maximum
expected accuracy alignment using partition function posterior probabilities. Base pair probabilities are estimated using an estimate similar to
Boltzmann distribution
In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability ...
. The partition function is calculated using a
dynamic programming
Dynamic programming is both a mathematical optimization method and a computer programming method. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
I ...
approach.
Algorithm
The following describes the algorithm used by probalign to determine the base pair probabilities.
Lecture "Bioinformatics II" at University of Freiburg
/ref>
Alignment score
To score an alignment of two sequences two things are needed:
* a similarity function (e.g. PAM, BLOSUM
In bioinformatics, the BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences. They are based o ...
,...)
* affine gap penalty:
The score of an alignment a is defined as:
Now the boltzmann weighted score of an alignment a is:
Where is a scaling factor.
The probability of an alignment assuming boltzmann distribution is given by
Where is the partition function, i.e. the sum of the boltzmann weights of all alignments.
Dynamic Programming
Let denote the partition function of the prefixes and . Three different cases are considered:
# the partition function of all alignments of the two prefixes that end in a match.
# the partition function of all alignments of the two prefixes that end in an insertion .
# the partition function of all alignments of the two prefixes that end in a deletion .
Then we have:
Initialization
The matrixes are initialized as follows:
*
*
*
*
Recursion
The partition function for the alignments of two sequences and is given by , which can be recursively computed:
*
*
* analogously
Base pair probability
Finally the probability that positions and form a base pair is given by:
are the respective values for the recalculated with inversed base pair strings.
See also
* ProbCons ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is one of the most efficient protein multiple sequence alignment programs, since it has repeatedly demonstrated a statistically significant adv ...
* Multiple Sequence Alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...
References
{{Reflist
External links
Probalign Webservice
Sequence alignment algorithms