Statistical coupling analysis or SCA is a technique used in
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
to measure
covariation
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the le ...
between pairs of
amino acids in a protein
multiple sequence alignment
Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...
(MSA). More specifically, it quantifies how much the amino acid distribution at some position ''i'' changes upon a perturbation of the amino acid distribution at another position ''j''. The resulting statistical coupling energy indicates the degree of evolutionary dependence between the residues, with higher coupling energy corresponding to increased dependence.
Definition of statistical coupling energy
Statistical coupling energy measures how a perturbation of amino acid distribution at one site in an MSA affects the amino acid distribution at another site. For example, consider a multiple sequence alignment with sites (or columns) ''a'' through ''z'', where each site has some distribution of amino acids. At position ''i'', 60% of the sequences have a
valine
Valine (symbol Val or V) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α- amino group (which is in the protonated −NH3+ form under biological conditions), an α- carboxylic acid group (which is in the deprotona ...
and the remaining 40% of sequences have a
leucine
Leucine (symbol Leu or L) is an essential amino acid that is used in the biosynthesis of proteins. Leucine is an α-amino acid, meaning it contains an α- amino group (which is in the protonated −NH3+ form under biological conditions), an α- ...
, at position ''j'' the distribution is 40%
isoleucine, 40%
histidine
Histidine (symbol His or H) is an essential amino acid that is used in the biosynthesis of proteins. It contains an α-amino group (which is in the protonated –NH3+ form under biological conditions), a carboxylic acid group (which is in the ...
and 20%
methionine, ''k'' has an average distribution (the 20 amino acids are present at roughly the same frequencies seen in all proteins), and ''l'' has 80% histidine, 20% valine. Since positions ''i'', ''j'' and ''l'' have an amino acid distribution different from the mean distribution observed in all proteins, they are said to have some degree of conservation.
In statistical coupling analysis, the conservation (ΔG
stat) at each site (''i'') is defined as:
.
Here, P
ix describes the probability of finding amino acid ''x'' at position ''i'', and is defined by a function in
binomial form as follows:
,
where N is 100, n
x is the percentage of sequences with residue ''x'' (e.g. methionine) at position ''i'', and p
x corresponds to the approximate distribution of amino acid ''x'' in all positions among all sequenced proteins. The summation runs over all 20 amino acids. After ΔG
istat is computed, the conservation for position ''i'' in a subalignment produced after a perturbation of amino acid distribution at ''j'' (ΔG
i , δjstat) is taken. Statistical coupling energy, denoted ΔΔG
i, jstat, is simply the difference between these two values. That is:
, or, more commonly,
Statistical coupling energy is often systematically calculated between a fixed, perturbated position, and all other positions in an MSA. Continuing with the example MSA from the beginning of the section, consider a perturbation at position ''j'' where the amino distribution changes from 40% I, 40% H, 20% M to 100% I. If, in a subsequent subalignment, this changes the distribution at ''i'' from 60% V, 40% L to 90% V, 10% L, but does not change the distribution at position ''l'', then there would be some amount of statistical coupling energy between ''i'' and ''j'' but none between ''l'' and ''j''.
Applications
Ranganathan and Lockless originally developed SCA to examine thermodynamic (energetic) coupling of residue pairs in proteins. Using the
PDZ domain
The PDZ domain is a common structural domain of 80-90 amino-acids found in the signaling proteins of bacteria, yeast, plants, viruses and animals. Proteins containing PDZ domains play a key role in anchoring receptor proteins in the membrane t ...
family, they were able to identify a small network of residues that were energetically coupled to a binding site residue. The network consisted of both residues spatially close to the binding site in the tertiary fold, called contact pairs, and more distant residues that participate in longer-range energetic interactions. Later applications of SCA by th
Ranganathan groupon the
GPCR
G protein-coupled receptors (GPCRs), also known as seven-(pass)-transmembrane domain receptors, 7TM receptors, heptahelical receptors, serpentine receptors, and G protein-linked receptors (GPLR), form a large group of evolutionarily-related p ...
,
serine protease
Serine proteases (or serine endopeptidases) are enzymes that cleave peptide bonds in proteins. Serine serves as the nucleophilic amino acid at the (enzyme's) active site.
They are found ubiquitously in both eukaryotes and prokaryotes. S ...
and
hemoglobin
Hemoglobin (haemoglobin BrE) (from the Greek word αἷμα, ''haîma'' 'blood' + Latin ''globus'' 'ball, sphere' + ''-in'') (), abbreviated Hb or Hgb, is the iron-containing oxygen-transport metalloprotein present in red blood cells (erythrocyte ...
families also showed energetic coupling in sparse networks of residues that cooperate in
allosteric communication.
Statistical coupling analysis has also been used as a basis for computational protein design. In 2005, Socolich et al. used an SCA for the
WW domain
The WW domain, (also known as the rsp5-domain or WWP repeating motif) is a modular protein domain that mediates specific interactions with protein ligands. This domain is found in a number of unrelated signaling and structural proteins and may be ...
to create artificial proteins with similar
thermodynamic stability
In chemistry, chemical stability is the thermodynamic stability of a chemical system.
Thermodynamic stability occurs when a system is in its lowest energy state, or in chemical equilibrium with its environment. This may be a dynamic equilibrium ...
and
structure to natural WW domains. The fact that 12 out of the 43 designed proteins with the same SCA profile as natural WW domains properly folded provided strong evidence that little information—only coupling information—was required for specifying the protein fold. This support for the SCA hypothesis was made more compelling considering that a) the successfully folded proteins had only 36% average
sequence identity
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
to natural WW folds, and b) none of the artificial proteins designed without coupling information folded properly. An accompanying study showed that the artificial WW domains were functionally similar to natural WW domains in
ligand binding affinity and specificity.
In
''de novo'' protein structure prediction, it has been shown that, when combined with a simple residue-residue distance metric, SCA-based scoring can fairly accurately distinguish native from non-native protein folds.
See also
Mutual information
In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...
External links
What is a WW domain?*
ttp://www.pandasthumb.org/archives/2005/10/protein-folding.html Protein folding — a step closer? - A summary of the Ranganathan lab's SCA-based design of artificial yet functional WW domains.
References
{{reflist
Bioinformatics