Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in

computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...

. The common idea of these methods is to use

statistical modeling A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, ...

to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statisti ...

, which can be large even if there is no direct relationship between the positions (hence the name ''direct'' coupling analysis). Such a direct relationship can for example be the

evolutionary pressure Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...

for two positions to maintain mutual compatibility in the

biomolecular structure Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length sca ...

of the sequence, leading to molecular coevolution between the two positions. DCA has been used in the inference of protein residue contacts,

RNA structure prediction Nucleic acid structure prediction is a computational method to determine ''secondary'' and ''tertiary'' nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structur ...

, the inference of protein-protein interaction networks, the modeling of

fitness landscape Fitness may refer to: * Physical fitness, a state of health and well-being of the body * Fitness (biology), an individual's ability to propagate its genes * Fitness (cereal), a brand of breakfast cereals and granola bars * ''Fitness'' (magazine), ...

s, and the identification of functionally relevant residue communities.

Mathematical Model and Inference

Mathematical Model

The basis of DCA is a statistical model for the variability within a set of phylogenetically related biological sequences. When fitted to a

multiple sequence alignment Multiple sequence alignment (MSA) may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolution ...

(MSA) of sequences of length

N

, the model defines a probability for all possible sequences of the same length. This probability can be interpreted as the probability that the sequence in question belongs to the same class of sequences as the ones in the MSA, for example the class of all protein sequences belonging to a specific

protein family A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be c ...

. We denote a sequence by

a = (a_1,a_2..,a_N)

, with the

a_i

being categorical variables representing the

monomer In chemistry, a monomer ( ; ''mono-'', "one" + '' -mer'', "part") is a molecule that can react together with other monomer molecules to form a larger polymer chain or three-dimensional network in a process called polymerization. Classification ...

s of the sequence (if the sequences are for example

aligned ''Aligned'' is a 2023 drama film written and directed by Apollo Bakopoulos. The film had its world premiere at the Brooklyn Film Festival in 2023 and, in the UK, at the BFI Flare: London LGBTIQ+ Film Festival in 2024. The story centers around t ...

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha ...

sequences of proteins of a protein family, the

a_i

take as values any of the 20 standard amino acids). The probability of a sequence within a model is then defined as :

\begin
P\left(a ,  J,h\right) = \frac \exp,
\end

where :*

J,h

are sets of real numbers representing the parameters of the model (more below) :*

Z

is a normalization constant (a real number) to ensure

\sum\limits_ P(a ,  J,h) = 1

The parameters

h_i(a_i)

depend on one position

i

and the symbol

a_i

at this position. They are usually called fields and represent the propensity of symbol to be found at a certain position. The parameters

J_(a_i,a_j)

depend on pairs of positions

i,j

and the symbols

a_i,a_j,

at these positions. They are usually called couplings and represent an interaction, i.e. a term quantifying how compatible the symbols at both positions are with each other. The model is fully connected, so there are interactions between all pairs of positions. The model can be seen as a generalization of the

Ising model The Ising model () (or Lenz-Ising model or Ising-Lenz model), named after the physicists Ernst Ising and Wilhelm Lenz, is a mathematical model of ferromagnetism in statistical mechanics. The model consists of discrete variables that represent ...

, with spins not only taking two values, but any value from a given finite alphabet. In fact, when the size of the alphabet is 2, the model reduces to the Ising model. Since it is also reminiscent of the model of the same name, it is often called Potts Model. Even knowing the probabilities of all sequences does not determine the parameters

J,h

uniquely. For example, a simple transformation of the parameters :

J_(a,b) \rightarrow J_(a,b) + R_

for any set of real numbers

R_

leaves the probabilities the same. The

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

is invariant under such transformations as well, so the data cannot be used to fix these degrees of freedom (although a prior on the parameters might do so). A convention often found in literature is to fix these degrees of freedom such that the

Frobenius norm In mathematics, a matrix norm is a vector norm in a vector space whose elements (vectors) are matrices (of given dimensions). Preliminaries Given a field K of either real or complex numbers, let K^ be the -vector space of matrices with m ...

of the coupling matrix :

F_ = \sqrt,

is minimized (independently for every pair of positions

i

and

j

Maximum Entropy Derivation

To justify the Potts model, it is often noted that it can be derived following a

maximum entropy principle The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition ...

: For a given set of sample

covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the le ...

s and frequencies, the Potts model represents the distribution with the maximal

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum Will ...

of all distributions reproducing those covariances and frequencies. For a

, the sample covariances are defined as :

C_(a,b) = f_(a,b) - f_i(a)f_j(b)

, where

f_(a,b)

is the frequency of finding symbols

a

and

b

at positions

i

and

j

in the same sequence in the MSA, and

f_i(a)

the frequency of finding symbol

a

at position

i

. The Potts model is then the unique distribution

P

that maximizes the functional :

= &- \sum\limits_ P(a) \log P(a) \\ &+ \sum\limits_ \sum\limits_ \lambda_(x,y) \Big( P_(x,y) - f_(x,y) \Big) \\ &+ \sum\limits_\sum\limits_ \lambda_(x) \Big( P_i(x) - f_i(x) \Big) \\ &+ \Omega \left(1 - \sum\limits_ P(a)\right). \end

The first term in the functional is the

of the distribution. The

\lambda

are

Lagrange multiplier In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied ...

s to ensure

P_(x,y) = f_(x,y)

, with

P_(x,y)

being the marginal probability to find symbols

x,y

at positions

i,j

. The Lagrange multiplier

\Omega

ensures normalization. Maximizing this functional and identifying :

\begin
&\lambda_(x,y) = J_(x,y) \\
&\lambda_(x) = h_i(x) \\
&\Omega = Z \\
\end

leads to the Potts model above. This procedure only gives the functional form of the Potts model, while the numerical values of the Lagrange multipliers (identified with the parameters) still have to be determined by fitting the model to the data.

Direct Couplings and Indirect Correlation

The central point of DCA is to interpret the

J_

(which can be represented as a

q\times q

matrix if there are

q

possible symbols) as direct couplings. If two positions are under joint

(for example to maintain a structural bond), one might expect these couplings to be large because only sequences with fitting pairs of symbols should have a significant probability. On the other hand, a large correlation between two positions does not necessarily mean that the couplings are large, since large couplings between e.g. positions

i,j

and

j,k

might lead to large correlations between positions

i

and

k

, mediated by position

j

. In fact, such indirect correlations have been implicated in the high false positive rate when inferring protein residue contacts using correlation measures like

mutual information In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the " amount of information" (in units such as ...

Inference

The inference of the Potts model on a

(MSA) using

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

is usually computationally intractable, because one needs to calculate the normalization constant

Z

, which is for sequence length

N

and

q

possible symbols a sum of

q^N

terms (which means for example for a small protein domain family with 30 positions

20^

terms). Therefore, numerous approximations and alternatives have been developed: * mpDCA (inference based on message passing/belief propagation) * mfDCA (inference based on a mean-field approximation) * gaussDCA (inference based on a

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

approximation) * plmDCA (inference based on pseudo-likelihoods) * Adaptive Cluster Expansion All of these methods lead to some form of estimate for the set of parameters

J,

maximizing the likelihood of the MSA. Many of them include

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

or prior terms to ensure a well-posed problem or promote a sparse solution.

Applications

Protein Residue Contact Prediction

A possible interpretation of large values of couplings in a model fitted to a MSA of a protein family is the existence of conserved contacts between positions (residues) in the family. Such a contact can lead to molecular coevolution, since a mutation in one of the two residues, without a compensating mutation in the other residue, is likely to disrupt

protein structure Protein structure is the molecular geometry, three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single ami ...

and negatively affect the fitness of the protein. Residue pairs for which there is a strong

selective pressure Any cause that reduces or increases reproductive success in a portion of a population potentially exerts evolutionary pressure, selective pressure or selection pressure, driving natural selection. It is a quantitative description of the amount of ...

to maintain mutual compatibility are therefore expected to mutate together or not at all. This idea (which was known in literature long before the conception of DCA) has been used to predict protein contact maps, for example analyzing the mutual information between protein residues. Within the framework of DCA, a score for the strength of the direct interaction between a pair of residues

i,j

is often defined using the Frobenius norm

F_

of the corresponding coupling matrix

J_

and applying an ''average product correction'' (APC): :

F^_ = F_ - \frac,

where

F_

has been defined above and :

\begin
&F_ = \frac\sum\limits_^ F_ \\
&F = \frac\sum\limits_^ F_
\end

. This correction term was first introduced for mutual information and is used to remove biases of specific positions to produce large

F_

. Scores that are invariant under parameter transformations that do not affect the probabilities have also been used. Sorting all residue pairs by this score results in a list in which the top of the list is strongly enriched in residue contacts when compared to the protein contact map of a homologous protein. High-quality predictions of residue contacts are valuable as prior information in

protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is differen ...

Inference of protein-protein interaction

DCA can be used for detecting conserved

interaction Interaction is action that occurs between two or more objects, with broad use in philosophy and the sciences. It may refer to: Science * Interaction hypothesis, a theory of second language acquisition * Interaction (statistics) * Interaction ...

between protein families and for predicting which residue pairs form contacts in a

protein complex A protein complex or multiprotein complex is a group of two or more associated polypeptide chains. Protein complexes are distinct from multienzyme complexes, in which multiple catalytic domains are found in a single polypeptide chain. Protein ...

. Such predictions can be used when generating structural models for these complexes, or when inferring protein-protein interaction networks made from more than two proteins.

Modeling of fitness landscapes

DCA can be used to model fitness landscapes and to predict the effect of a mutation in the amino acid sequence of a protein on its fitness.

External links

Online services:
EVcouplingsGremlinDCA WebserviceAmoAiELIHKSIR
Source code:
gplmDCAGaussDCAplmDCA
Useful applications:
DCA-MOL: a PyMOL plugin to analyze DCA results on a structure
ref name="jarmolinska2019">

References

{{Reflist Bioinformatics