Background
Homology is the relationship between biological structures or sequences derived from a common ancestor. Homologous proteins (proteins who have common ancestry) are inferred from their sequence similarity. Inferring homologous relationships involves calculating scores of aligned pairs minus penalties for gaps. Aligning pairs of proteins identify regions of similarity indicating a relationship between the two, or more, proteins. In order to have a homologous relationship, the sum of scores over all the aligned pairs of amino acids or nucleotides must be sufficiently high Standard methods of sequence comparisons use aPerformance
CS-BLAST greatly improves alignment quality over the entire range of sequence identities and especially for difficult alignments in comparison to regular BLAST and PSI-BLAST. PSI-BLAST (Position-Specific Iterated BLAST) runs at about the same speed per iteration as regular BLAST, but is able to detect weaker sequence similarities that are still biologically relevant. Alignment quality is based on alignment sensitivity and alignment precision.Alignment Quality
Alignment sensitivity is measured by correctly comparing predicted alignments of residue pairs to the total number of possible alignable pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs structurally alignable) Alignment precision is measured by the correctness of aligned residue pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs aligned)Search Performance
The graph is the benchmark Biegert and Söding used to evaluate homology detection. The benchmark compares CS-BLAST to BLAST using true positives from the same superfamily versus false positive of pairs from different folds. (A GRAPH NEEDS TO GO HERE) The other graph uses detects true positives (with a different scale than the previous graph) and false positives of PSI-BLAST and CSI-BLAST and compares the two for one to five iterations. (A DIFFERENT GRAPH NEEDS TO GO HERE) CS-BLAST offers improved sensitivity and alignment quality in sequence comparison. Sequence searches with CS-BLAST are more than twice as sensitive as BLAST. It produces higher quality alignments and generates reliable E-values without a loss of speed. CS-BLAST detects 139% more homologous proteins at a cumulative error rate of 20%. At a 10% error rate, 138% more homologs are detected, and for the easiest cases at a 1% error rate, CS-BLAST was still 96% more effective than BLAST. Additionally, CS-BLAST in 2 iterations is more sensitive than 5 iterations of PSI-BLAST. About 15% more homologs were detected in comparison.Method
The CS-BLAST method derives similarities between sequence context-specific amino acids for 13 residue windows centered on each residue. CS-BLAST works by generating a sequence profile for a query sequence by using context-specific mutations and then jumpstarting a profile-to-sequence search method. CS-BLAST starts by predicting the expected mutation probabilities for each position. For a certain residue, a sequence window of ten total surrounding residues is selected as seen in the image. Then, Biegert and Söding compared the sequence window to a library with thousands of context profiles. The library is generated by clustering a representative set of sequence profile windows. The actual predicting of mutation probabilities is achieved by weighted mixing of the central columns of the most similar context profiles. This aligns short profiles that are nonhomologous and ungapped which gives higher weight to better matching profiles, making them easier to detect. A sequence profile represents a multiple alignment of homologous sequences and describes what amino acids are likely to occur at each position in related sequences. With this method substitution matrices are unnecessary. In addition, there is no need for transition probabilities as a result of the fact that context information is encoded within the context profiles. This makes computation simpler and allows for runtime to be scaled linearly instead of quadratically. The context specific mutation probability, the probability of observing a specific amino acid in a homologous sequence given a context, is calculated by a weighted mixing of the amino acids in the central columns of the most similar context profiles. The image illustrates the calculation of expected mutation probabilities for a specific residue at a certain position. As seen in the image, the library of context profiles all contribute based on similarity to the context specific sequence profile for the query sequence.Models
In predicting substitution probabilities using only the amino acid’s local sequence context, you gain the advantage of not needing to know the structure of the query protein while still allowing for the detection of more homologous proteins than standard substitution matrices Bigert and Söding’s approach to predicting substitution probabilities was based on a generative model. In another paper in collaboration with Angermüller, they develop a discriminative machine learning method that improves prediction accuracyGenerative Model
Given an observed variable and a target variable , a generative model defines the probabilities and separately. In order to predict the unobserved target variable, , Bayes’ theorem, is used. A generative model, as the name suggests, allows one to generate new data points . The joint distribution is described as . To train a generative model, the following equation is used to maximize the joint probability .Discriminative Model
The discriminative model is a logistic regression maximum entropy classifier. With the discriminative model, the goal is to predict a context specific substitution probability given a query sequence. The discriminative approach for modeling substitution probabilities, where describes a sequence of amino acids around position of a sequence, is based on context states. Context states are characterized by parameters emission weight (), bias weight (), and context weight () Emission probabilities from a context state are given by the emission weights as follows for to : where is the emission probability and is the context state. In the discriminative approach, probability for a context state given context is modeled directly by the exponential of an affine function of the context account profile where is the context count profile with a normalization constant normalizes the probability to 1. This equation is as follows where the first summation takes to and the second summation takes to : . As with the generative model, target distribution is obtained by mixing the emission probabilities of each context state weighted by the similarity.Using CS-BLAST
The MPI Bioinformatics toolkit in an interactive website and service that allows anyone to do comprehensive and collaborative protein analysis with a variety of different tools including CS-BLAST as well as PSI-BLAST This tool allows for input of a protein and select options for you to customize your analysis. It also can forward the output to other tools as well.See also
* Sequence alignment software *References
{{reflist Alva, Vikram, Seung-Zin Nam, Johannes Söding, and Andrei N. Lupas. “The MPI Bioinformatics Toolkit as an Integrative Platform for Advanced Protein Sequence and Structure Analysis.” ''Nucleic Acids Research'' 44.Web server Issue (2016): W410-415. ''NCBI''. Web. 2 Nov. 2016. Angermüller, Christof, Andreas Biegert, and Johannes Söding. “Discriminative Modelling of Context-specific Amino Acid Substitution Properties” ''BIOINFORMATICS'' 28.24 (2012): 3240-247. ''Oxford Journals''. Web. 2 Nov. 2016. Astschul, Stephen F., et al. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” ''Nucleic Acids Research'' 25.17 (1997): 3389-402. ''Oxford University Press.'' Print Bigert, A., and J. Söding. “Sequence Context-specific Profiles for Homology Searching.” ''Proceedings of the National Academy of Sciences'' 106.10 (2009): 3770-3775. PNAS. Web. 23 Oct. 2016.External links