CS-BLAST (Context-Specific BLAST) is a tool that searches a

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

sequence that extends BLAST (Basic Local Alignment Search Tool), using context-specific mutation probabilities. More specifically, CS-BLAST derives context-specific

amino-acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

similarities on each query sequence from short windows on the query sequences. Using CS-BLAST doubles sensitivity and significantly improves alignment quality without a loss of speed in comparison to BLAST. CSI-BLAST (Context-Specific Iterated BLAST) is the context-specific analog of

PSI-BLAST In bioinformatics, BLAST (basic local alignment search tool) is an algorithm and program for comparing Primary structure, primary biological sequence information, such as the amino acid, amino-acid sequences of proteins or the nucleotides of DNA ...

(Position-Specific Iterated BLAST), which computes the mutation profile with substitution probabilities and mixes it with the query profile. CSI-BLAST (Context-Specific Iterated BLAST) is the context specific analog of PSI-BLAST (Position-Specific Iterated BLAST). Both of these programs are available as web-server and are available for free download.

Background

Homology is the relationship between biological structures or sequences derived from a common ancestor. Homologous proteins (proteins who have common ancestry) are inferred from their sequence similarity. Inferring homologous relationships involves calculating scores of aligned pairs minus penalties for gaps. Aligning pairs of proteins identify regions of similarity indicating a relationship between the two, or more, proteins. In order to have a homologous relationship, the sum of scores over all the aligned pairs of amino acids or nucleotides must be sufficiently high Standard methods of sequence comparisons use a

substitution matrix In bioinformatics and evolutionary biology, a substitution matrix describes the frequency at which a character in a Nucleic acid sequence, nucleotide sequence or a Protein primary structure, protein sequence changes to other character states ove ...

to accomplish this Similarities between amino acids or nucleotides are quantified in these substitution matrices. The substitution score (

S

) of amino acids

a

and

b

can we written as follows:

S(a,b) = const \times \log \left ( \frac \right )

where

P(a, b)

denotes the probability of amino acid

a

mutating into amino acid

b

In a large set of sequence alignments, counting the number of amino acids as well as the number of aligned pairs

(a, b)

will allow you to derive the probabilities

P(a, b)

and

P(a)

. Since protein sequences need to maintain a stable structure, a residue’s substitution probabilities are largely determined by the structural context of where it is found. As a result, substitution matrices are trained for structural contexts. Since context information is encoded in transition probabilities between states, mixing mutation probabilities from substitution matrices weighted for corresponding states achieves improved alignment qualities when compared to standard substitution matrices. CS-BLAST improves further upon this concept. The figure illustrates the sequence to sequence and profile to sequence equivalence with the alignment matrix. The query profile results from the artificial mutations in which the bar heights are proportional to the corresponding amino acid probabilities. (A FIGURE NEEDS TO GO HERE THIS IS THE CAPTION) “Sequence search/alignment algorithms find the path that maximizes the sum of similarity scores (color-coded blue to red). Substitution matrix scores are equivalent to profile scores if the sequence profile (colored histogram) is generated from the query sequence by adding artificial mutations with the substitution matrix pseudocount scheme. Histogram bar heights represent the fraction of amino acids in profile columns”.

Performance

CS-BLAST greatly improves alignment quality over the entire range of sequence identities and especially for difficult alignments in comparison to regular BLAST and PSI-BLAST. PSI-BLAST (Position-Specific Iterated BLAST) runs at about the same speed per iteration as regular BLAST, but is able to detect weaker sequence similarities that are still biologically relevant. Alignment quality is based on alignment sensitivity and alignment precision.

Alignment Quality

Alignment sensitivity is measured by correctly comparing predicted alignments of residue pairs to the total number of possible alignable pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs structurally alignable) Alignment precision is measured by the correctness of aligned residue pairs. This is calculated with the fraction: (pairs correctly aligned)/(pairs aligned)

Search Performance

The graph is the benchmark Biegert and Söding used to evaluate homology detection. The benchmark compares CS-BLAST to BLAST using true positives from the same superfamily versus false positive of pairs from different folds. (A GRAPH NEEDS TO GO HERE) The other graph uses detects true positives (with a different scale than the previous graph) and false positives of PSI-BLAST and CSI-BLAST and compares the two for one to five iterations. (A DIFFERENT GRAPH NEEDS TO GO HERE) CS-BLAST offers improved sensitivity and alignment quality in sequence comparison. Sequence searches with CS-BLAST are more than twice as sensitive as BLAST. It produces higher quality alignments and generates reliable E-values without a loss of speed. CS-BLAST detects 139% more homologous proteins at a cumulative error rate of 20%. At a 10% error rate, 138% more homologs are detected, and for the easiest cases at a 1% error rate, CS-BLAST was still 96% more effective than BLAST. Additionally, CS-BLAST in 2 iterations is more sensitive than 5 iterations of PSI-BLAST. About 15% more homologs were detected in comparison.

Method

The CS-BLAST method derives similarities between sequence context-specific amino acids for 13 residue windows centered on each residue. CS-BLAST works by generating a sequence profile for a query sequence by using context-specific mutations and then jumpstarting a profile-to-sequence search method. CS-BLAST starts by predicting the expected mutation probabilities for each position. For a certain residue, a sequence window of ten total surrounding residues is selected as seen in the image. Then, Biegert and Söding compared the sequence window to a library with thousands of context profiles. The library is generated by clustering a representative set of sequence profile windows. The actual predicting of mutation probabilities is achieved by weighted mixing of the central columns of the most similar context profiles. This aligns short profiles that are nonhomologous and ungapped which gives higher weight to better matching profiles, making them easier to detect. A sequence profile represents a multiple alignment of homologous sequences and describes what amino acids are likely to occur at each position in related sequences. With this method substitution matrices are unnecessary. In addition, there is no need for transition probabilities as a result of the fact that context information is encoded within the context profiles. This makes computation simpler and allows for runtime to be scaled linearly instead of quadratically. The context specific mutation probability, the probability of observing a specific amino acid in a homologous sequence given a context, is calculated by a weighted mixing of the amino acids in the central columns of the most similar context profiles. The image illustrates the calculation of expected mutation probabilities for a specific residue at a certain position. As seen in the image, the library of context profiles all contribute based on similarity to the context specific sequence profile for the query sequence.

Models

In predicting substitution probabilities using only the amino acid’s local sequence context, you gain the advantage of not needing to know the structure of the query protein while still allowing for the detection of more homologous proteins than standard substitution matrices Bigert and Söding’s approach to predicting substitution probabilities was based on a generative model. In another paper in collaboration with Angermüller, they develop a discriminative machine learning method that improves prediction accuracy

Generative Model

Given an observed variable

x

and a target variable

y

, a generative model defines the probabilities

P(x, y)

and

P(y)

separately. In order to predict the unobserved target variable,

y

, Bayes’ theorem,

P(y, x) = \left ( \frac \right )

is used. A generative model, as the name suggests, allows one to generate new data points

(x, y)

. The joint distribution is described as

P(x,y) = P(x, y)P(y)

. To train a generative model, the following equation is used to maximize the joint probability

\prod\left ( \frac \right )

Discriminative Model

The discriminative model is a logistic regression maximum entropy classifier. With the discriminative model, the goal is to predict a context specific substitution probability given a query sequence. The discriminative approach for modeling substitution probabilities,

P(a, C_l)

where

C_l

describes a sequence of amino acids around position

l

of a sequence, is based on

K

context states. Context states are characterized by parameters emission weight (

v_k(a)

), bias weight (

\pi_k

), and context weight (

\lambda_k(j,a)

) Emission probabilities from a context state are given by the emission weights as follows for

d=1

20

P(a, k) = \left ( \frac \right )

where

P(a, k)

is the emission probability and is the context state. In the discriminative approach, probability for a context state

k

given context

C_l

is modeled directly by the exponential of an affine function of the context account profile where

C_l(j,a)

is the context count profile with a normalization constant

Z(C_l)

normalizes the probability to 1. This equation is as follows where the first summation takes

j=-d

d

and the second summation takes

a=1

20

P(k, C_l) = \left ( \fracexp(\pi_k + \pi\sum\sum \lambda_k(j,a)(C_l(j,a)) \right)

. As with the generative model, target distribution is obtained by mixing the emission probabilities of each context state weighted by the similarity.

Using CS-BLAST

The MPI Bioinformatics toolkit in an interactive website and service that allows anyone to do comprehensive and collaborative protein analysis with a variety of different tools including CS-BLAST as well as PSI-BLAST This tool allows for input of a protein and select options for you to customize your analysis. It also can forward the output to other tools as well.

References

{{reflist Alva, Vikram, Seung-Zin Nam, Johannes Söding, and Andrei N. Lupas. “The MPI Bioinformatics Toolkit as an Integrative Platform for Advanced Protein Sequence and Structure Analysis.” ''Nucleic Acids Research'' 44.Web server Issue (2016): W410-415. ''NCBI''. Web. 2 Nov. 2016. Angermüller, Christof, Andreas Biegert, and Johannes Söding. “Discriminative Modelling of Context-specific Amino Acid Substitution Properties” ''BIOINFORMATICS'' 28.24 (2012): 3240-247. ''Oxford Journals''. Web. 2 Nov. 2016. Astschul, Stephen F., et al. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” ''Nucleic Acids Research'' 25.17 (1997): 3389-402. ''Oxford University Press.'' Print Bigert, A., and J. Söding. “Sequence Context-specific Profiles for Homology Searching.” ''Proceedings of the National Academy of Sciences'' 106.10 (2009): 3770-3775. PNAS. Web. 23 Oct. 2016.

External links

CS-BLAST
— free server at University of Munich (LMU)
CS-BLAST
— free server at Max-Planck Institute in Tuebingen
CS-BLAST source code
Bioinformatics software Computational science

Background

Performance

Alignment Quality

Search Performance

Method

Models

Generative Model

Discriminative Model

Using CS-BLAST

See also

References

External links