protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is differen ...

, statistical potentials or knowledge-based potentials are scoring functions derived from an analysis of known protein structures in the

Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, c ...

(PDB). The original method to obtain such potentials is the ''quasi-chemical approximation'', due to Miyazawa and Jernigan. It was later followed by the ''potential of mean force'' (statistical PMF ), developed by Sippl. Although the obtained scores are often considered as approximations of the free energy—thus referred to as ''pseudo-energies''—this physical interpretation is incorrect. Nonetheless, they are applied with success in many cases, because they frequently correlate with actual

Gibbs free energy In thermodynamics, the Gibbs free energy (or Gibbs energy; symbol G) is a thermodynamic potential that can be used to calculate the maximum amount of work (physics), work that may be performed by a closed system, thermodynamically closed system a ...

differences.

Overview

Possible features to which a pseudo-energy can be assigned include: * interatomic distances, * torsion angles, * solvent exposure, * or

hydrogen bond In chemistry, a hydrogen bond (or H-bond) is a primarily electrostatic force of attraction between a hydrogen (H) atom which is covalently bound to a more electronegative "donor" atom or group (Dn), and another electronegative atom bearing ...

geometry. The classic application is, however, based on pairwise amino acid contacts or distances, thus producing statistical interatomic potentials. For pairwise amino acid contacts, a statistical potential is formulated as an interaction matrix that assigns a weight or energy value to each possible pair of standard amino acids. The energy of a particular structural model is then the combined energy of all pairwise contacts (defined as two amino acids within a certain distance of each other) in the structure. The energies are determined using statistics on amino acid contacts in a database of known protein structures (obtained from the PDB).

History

Initial development

Many textbooks present the statistical PMFs as proposed by Sippl as a simple consequence of the

Boltzmann distribution In statistical mechanics and mathematics, a Boltzmann distribution (also called Gibbs distribution Translated by J.B. Sykes and M.J. Kearsley. See section 28) is a probability distribution or probability measure that gives the probability ...

, as applied to pairwise distances between amino acids. This is incorrect, but a useful start to introduce the construction of the potential in practice. The Boltzmann distribution applied to a specific pair of amino acids, is given by: :

P\left(r\right)=\frace^

where

r

is the distance,

k

is the

Boltzmann constant The Boltzmann constant ( or ) is the proportionality factor that relates the average relative kinetic energy of particles in a gas with the thermodynamic temperature of the gas. It occurs in the definitions of the kelvin and the gas consta ...

T

is the temperature and

Z

is the partition function, with :

Z=\int e^dr

The quantity

F(r)

is the free energy assigned to the pairwise system. Simple rearrangement results in the ''inverse Boltzmann formula'', which expresses the free energy

F(r)

as a function of

P(r)

: :

F\left(r\right)=-kT\ln P\left(r\right)-kT\ln Z

To construct a PMF, one then introduces a so-called ''reference state'' with a corresponding distribution

Q_

and partition function

Z_

, and calculates the following free energy difference: :

\Delta F\left(r\right)=-kT\ln\frac-kT\ln\frac

The reference state typically results from a hypothetical system in which the specific interactions between the amino acids are absent. The second term involving

Z

and

Z_

can be ignored, as it is a constant. In practice,

P(r)

is estimated from the database of known protein structures, while

Q_(r)

typically results from calculations or simulations. For example,

P(r)

could be the conditional probability of finding the

C\beta

atoms of a valine and a serine at a given distance

r

from each other, giving rise to the free energy difference

\Delta F

. The total free energy difference of a protein,

\Delta F_

, is then claimed to be the sum of all the pairwise free energies: where the sum runs over all amino acid pairs

a_,a_

(with

i) and r_is their corresponding distance. In many studies Q_does not depend on the

amino acid sequence Protein primary structure is the linear sequence of amino acids in a peptide or protein. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end. Protein biosynthesi ...

Conceptual issues

Intuitively, it is clear that a low value for

\Delta F_

indicates that the set of distances in a structure is more likely in proteins than in the reference state. However, the physical meaning of these statistical PMFs has been widely disputed, since their introduction. The main issues are: # The wrong interpretation of this "potential" as a true, physically valid potential of mean force; # The nature of the so-called ''reference state'' and its optimal formulation; # The validity of generalizations beyond pairwise distances.

Controversial analogy

In response to the issue regarding the physical validity, the first justification of statistical PMFs was attempted by Sippl. It was based on an analogy with the statistical physics of liquids. For liquids, the potential of mean force is related to the

radial distribution function In statistical mechanics, the radial distribution function, (or pair correlation function) g(r) in a system of particles (atoms, molecules, colloids, etc.), describes how density varies as a function of distance from a reference particle. I ...

g(r)

, which is given by:Chandler D (1987) Introduction to Modern Statistical Mechanics. New York: Oxford University Press, USA. :

g(r)=\frac

where

P(r)

and

Q_(r)

are the respective probabilities of finding two particles at a distance

r

from each other in the liquid and in the reference state. For liquids, the reference state is clearly defined; it corresponds to the ideal gas, consisting of non-interacting particles. The two-particle potential of mean force

W(r)

is related to

g(r)

by: :

W(r)=-kT\log g(r)=-kT\log\frac

According to the reversible work theorem, the two-particle potential of mean force

W(r)

is the reversible work required to bring two particles in the liquid from infinite separation to a distance

r

from each other. Sippl justified the use of statistical PMFs—a few years after he introduced them for use in protein structure prediction—by appealing to the analogy with the reversible work theorem for liquids. For liquids,

g(r)

can be experimentally measured using small angle X-ray scattering; for proteins,

P(r)

is obtained from the set of known protein structures, as explained in the previous section. However, as Ben-Naim wrote in a publication on the subject:

..the quantities, referred to as "statistical potentials," "structure based potentials," or "pair potentials of mean force", as derived from the protein data bank (PDB), are neither "potentials" nor "potentials of mean force," in the ordinary sense as used in the literature on liquids and solutions.

Moreover, this analogy does not solve the issue of how to specify a suitable ''reference state'' for proteins.

Machine learning

In the mid-2000s, authors started to combine multiple statistical potentials, derived from different structural features, into ''composite scores''. For that purpose, they used

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

techniques, such as

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...

(SVMs). Probabilistic

neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

(PNNs) have also been applied for the training of a position-specific distance-dependent statistical potential. In 2016, the

DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was acquired by Google in 2014 and became a wholly owned subsidiary of Alphabet Inc, after Google's restru ...

artificial intelligence research laboratory started to apply deep learning techniques to the development of a torsion- and distance-dependent statistical potential. The resulting method, named

AlphaFold AlphaFold is an artificial intelligence (AI) program developed by DeepMind, a subsidiary of Alphabet, which performs predictions of protein structure. The program is designed as a deep learning system. AlphaFold AI software has had two major ve ...

, won the 13th Critical Assessment of Techniques for Protein Structure Prediction (CASP) by correctly predicting the most accurate structure for 25 out of 43 free modelling domains.

Explanation

Bayesian probability

Baker A baker is a tradesperson who bakes and sometimes sells breads and other products made of flour by using an oven or other concentrated heat source. The place where a baker works is called a bakery. History Ancient history Since grains ...

and co-workers justified statistical PMFs from a Bayesian point of view and used these insights in the construction of the coarse grained

ROSETTA Rosetta or Rashid (; ar, رشيد ' ; french: Rosette ; cop, ϯⲣⲁϣⲓⲧ ''ti-Rashit'', Ancient Greek: Βολβιτίνη ''Bolbitinē'') is a port city of the Nile Delta, east of Alexandria, in Egypt's Beheira governorate. The ...

energy function. According to

Bayesian probability Bayesian probability is an interpretation of the concept of probability, in which, instead of frequency or propensity of some phenomenon, probability is interpreted as reasonable expectation representing a state of knowledge or as quantification o ...

calculus, the conditional probability

P(X\mid
A)

of a structure

X

, given the amino acid sequence

A

, can be written as: :

P\left(X\mid A\right)=\frac\propto P\left(A\mid
X\right)P\left(X\right)

P(X\mid A)

is proportional to the product of the

likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...

P\left(A\mid X\right)

times the prior

P\left(X\right)

. By assuming that the likelihood can be approximated as a product of pairwise probabilities, and applying

Bayes' theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For exa ...

, the likelihood can be written as: where the product runs over all amino acid pairs

a_,a_

(with

i), and r_is the distance between amino acids i and j .
Obviously, the negative of the logarithm of the expression
has the same functional form as the classic
pairwise distance statistical PMFs, with the denominator playing the role of the
reference state. This explanation has two shortcomings: it relies on the unfounded assumption the likelihood can be expressed
as a product of pairwise probabilities, and it is purely ''qualitative''.

Probability kinematics

Hamelryck and co-workers later gave a ''quantitative'' explanation for the statistical potentials, according to which they approximate a form of probabilistic reasoning due to

Richard Jeffrey Richard Carl Jeffrey (August 5, 1926 – November 9, 2002) was an American philosopher, logician, and probability theorist. He is best known for developing and championing the philosophy of radical probabilism and the associated heuristic of ...

and named

probability kinematics Radical probabilism is a hypothesis in philosophy, in particular epistemology, and probability theory that holds that no facts are known for certain. That view holds profound implications for statistical inference. The philosophy is particularly a ...

. This variant of Bayesian thinking (sometimes called "

Jeffrey conditioning Richard Carl Jeffrey (August 5, 1926 – November 9, 2002) was an American philosopher, logician, and probability theorist. He is best known for developing and championing the philosophy of radical probabilism and the associated heuristic of ...

") allows updating a prior distribution based on new information on the probabilities of the elements of a partition on the support of the prior. From this point of view, (i) it is not necessary to assume that the database of protein structures—used to build the potentials—follows a Boltzmann distribution, (ii) statistical potentials generalize readily beyond pairwise differences, and (iii) the ''reference ratio'' is determined by the prior distribution.

Reference ratio

Expressions that resemble statistical PMFs naturally result from the application of probability theory to solve a fundamental problem that arises in protein structure prediction: how to improve an imperfect probability distribution

Q(X)

over a first variable

X

using a probability distribution

P(Y)

over a second variable

Y

, with

Y=f(X)

. Typically,

X

and

Y

are fine and coarse grained variables, respectively. For example,

Q(X)

could concern the local structure of the protein, while

P(Y)

could concern the pairwise distances between the amino acids. In that case,

X

could for example be a vector of dihedral angles that specifies all atom positions (assuming ideal bond lengths and angles). In order to combine the two distributions, such that the local structure will be distributed according to

Q(X)

, while the pairwise distances will be distributed according to

P(Y)

, the following expression is needed: :

P(X,Y)=\fracQ(X)

where

Q(Y)

is the distribution over

Y

implied by

Q(X)

. The ratio in the expression corresponds to the PMF. Typically,

Q(X)

is brought in by sampling (typically from a fragment library), and not explicitly evaluated; the ratio, which in contrast is explicitly evaluated, corresponds to Sippl's PMF. This explanation is quantitive, and allows the generalization of statistical PMFs from pairwise distances to arbitrary coarse grained variables. It also provides a rigorous definition of the reference state, which is implied by

Q(X)

. Conventional applications of pairwise distance statistical PMFs usually lack two necessary features to make them fully rigorous: the use of a proper probability distribution over pairwise distances in proteins, and the recognition that the reference state is rigorously defined by

Q(X)

Applications

Statistical potentials are used as energy functions in the assessment of an ensemble of structural models produced by

homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous p ...

protein threading Protein threading, also known as fold recognition, is a method of protein modeling which is used to model those proteins which have the same fold as proteins of known structures, but do not have homologous proteins with known structure. It differs ...

. Many differently parameterized statistical potentials have been shown to successfully identify the native state structure from an ensemble of

decoy A decoy (derived from the Dutch ''de'' ''kooi'', literally "the cage" or possibly ''ende kooi'', " duck cage") is usually a person, device, or event which resembles what an individual or a group might be looking for, but it is only meant to lur ...

or non-native structures. Statistical potentials are not only used for

, but also for modelling the

protein folding Protein folding is the physical process by which a protein chain is translated to its native three-dimensional structure, typically a "folded" conformation by which the protein becomes biologically functional. Via an expeditious and reprodu ...

pathway.

Notes

References

{{Reflist, 40em Potentials Computational biology Bioinformatics Molecular modelling Protein structure

Overview

History

Initial development

Conceptual issues

Controversial analogy

Machine learning

Explanation

Bayesian probability

Probability kinematics

Reference ratio

Applications

See also

Notes

References