AlphaFold 2
   HOME

TheInfoList



OR:

AlphaFold is an
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
(AI) program developed by
DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
, a subsidiary of
Alphabet An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syll ...
, which performs predictions of protein structure. The program is designed as a
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
system. AlphaFold AI software has had two major versions. A team of researchers that used AlphaFold 1 (2018) placed first in the overall rankings of the 13th Critical Assessment of protein Structure Prediction (CASP) in December 2018. The program was particularly successful at predicting the most accurate structure for targets rated as the most difficult by the competition organisers, where no existing template structures were available from proteins with a partially similar sequence. A team that used AlphaFold 2 (2020) repeated the placement in the CASP competition in November 2020. The team achieved a level of accuracy much higher than any other group. It scored above 90 for around two-thirds of the proteins in CASP's global distance test (GDT), a test that measures the degree to which a computational program predicted structure is similar to the lab experiment determined structure, with 100 being a complete match, within the distance cutoff used for calculating GDT.Robert F. Service
‘The game has changed.’ AI triumphs at solving protein structures
''
Science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence for ...
'', 30 November 2020
AlphaFold 2's results at CASP were described as "astounding" and "transformational." Some researchers noted that the accuracy is not high enough for a third of its predictions, and that it does not reveal the mechanism or rules of protein folding for the protein folding problem to be considered solved. Nevertheless, there has been widespread respect for the technical achievement. On 15 July 2021 the AlphaFold 2 paper was published at Nature as an advance access publication alongside
open source software Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose. Open ...
and a searchable database of species
proteomes The proteome is the entire set of proteins that is, or can be, expressed by a genome, cell, tissue, or organism at a certain time. It is the set of expressed proteins in a given type of cell or organism, at a given time, under defined conditions. ...
.


Protein folding problem

Protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respo ...
s consist of chains of amino acids which spontaneously fold, in a process called
protein folding Protein folding is the physical process by which a protein chain is translated to its native three-dimensional structure, typically a "folded" conformation by which the protein becomes biologically functional. Via an expeditious and reproduci ...
, to form the three dimensional (3-D) structures of the proteins. The 3-D structure is crucial to the biological function of the protein. However, understanding how the amino acid sequence can determine the 3-D structure is highly challenging, and this is called the "protein folding problem". The "protein folding problem" involves understanding the thermodynamics of the interatomic forces that determine the folded stable structure, the mechanism and pathway through which a protein can reach its final folded state with extreme rapidity, and how the native structure of a protein can be predicted from its amino acid sequence. Protein structures are currently determined experimentally by means of techniques such as
X-ray crystallography X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...
,
cryo-electron microscopy Cryogenic electron microscopy (cryo-EM) is a cryomicroscopy technique applied on samples cooled to cryogenic temperatures. For biological specimens, the structure is preserved by embedding in an environment of vitreous ice. An aqueous sample s ...
and
nuclear magnetic resonance Nuclear magnetic resonance (NMR) is a physical phenomenon in which nuclei in a strong constant magnetic field are perturbed by a weak oscillating magnetic field (in the near field) and respond by producing an electromagnetic signal with a ...
, techniques which are both expensive and time-consuming. Such efforts have identified the structures of about 170,000 proteins over the last 60 years, while there are over 200 million known proteins across all life forms. If it is possible to predict protein structure from the amino-acid sequence alone, it would greatly help to advance scientific research. However, the
Levinthal's paradox Levinthal's paradox is a thought experiment, also constituting a self-reference in the theory of protein folding. In 1969, Cyrus Levinthal noted that, because of the very large number of degrees of freedom in an unfolded polypeptide chain, the m ...
shows that while a protein can fold in milliseconds, the time it takes to calculate all the possible structures randomly to determine the true native structure is longer than the age of the known universe, which made predicting protein structures a grand challenge in biology for scientists. Over the years, researchers have applied numerous computational methods to resolve the issue of
protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different ...
, but their accuracy has not been close to experimental techniques except for small simple proteins, thus limiting their value.
CASP Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP prov ...
, which was launched in 1994 to challenge the scientific community to produce their best protein structure predictions, found that GDT scores of only about 40 out of 100 can be achieved for the most difficult proteins by 2016. AlphaFold started competing in the 2018 CASP using an
artificial intelligence Artificial intelligence (AI) is intelligence—perceiving, synthesizing, and inferring information—demonstrated by machines, as opposed to intelligence displayed by animals and humans. Example tasks in which this is done include speech re ...
(AI)
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
technique.


Algorithm

DeepMind is known to have trained the program on over 170,000 proteins from a public repository of protein sequences and structures. The program uses a form of attention network, a
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
technique that focuses on having the AI
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...
identify parts of a larger problem, then piece it together to obtain the overall solution. The overall training was conducted on processing power between 100 and 200
GPUs A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...
. Training the system on this hardware took "a few weeks", after which the program would take "a matter of days" to converge for each structure.


AlphaFold 1, 2018

AlphaFold 1 (2018) was built on work developed by various teams in the 2010s, work that looked at the large databanks of related DNA sequences now available from many different organisms (most without known 3D structures), to try to find changes at different residues that appeared to be correlated, even though the residues were not consecutive in the main chain. Such correlations suggest that the residues may be close to each other physically, even though not close in the sequence, allowing a contact map to be estimated. Building on recent work prior to 2018, AlphaFold 1 extended this to estimate a probability distribution for just ''how'' close the residues might be likely to be—turning the contact map into a likely distance map. It also used more advanced learning methods than previously to develop the inference. Combining a
statistical potential In protein structure prediction, statistical potentials or knowledge-based potentials are scoring functions derived from an analysis of known protein structures in the Protein Data Bank (PDB). The original method to obtain such potentials is the ...
based on this probability distribution with the calculated local free-energy of the configuration, the team was then able to use
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
to a solution that best fitted both. More technically, Torrisi ''et al'' summarised in 2019 the approach of AlphaFold version 1 as follows:
Central to AlphaFold is a distance map predictor implemented as a very deep
residual neural network A residual neural network (ResNet) is an artificial neural network (ANN). It is a gateless or open-gated variant of the HighwayNet, the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural n ...
s with 220 residual blocks processing a representation of dimensionality 64×64×128 – corresponding to input features calculated from two 64 amino acid fragments. Each residual block has three layers including a 3×3 dilated convolutional layer – the blocks cycle through dilation of values 1, 2, 4, and 8. In total the model has 21 million parameters. The network uses a combination of 1D and 2D inputs, including evolutionary profiles from different sources and co-evolution features. Alongside a distance map in the form of a very finely-grained histogram of distances, AlphaFold predicts Φ and Ψ angles for each residue which are used to create the initial predicted 3D structure. The AlphaFold authors concluded that the depth of the model, its large crop size, the large training set of roughly 29,000 proteins, modern Deep Learning techniques, and the richness of information from the predicted histogram of distances helped AlphaFold achieve a high contact map prediction precision.


AlphaFold 2, 2020

The 2020 version of the program (AlphaFold 2, 2020) is significantly different from the original version that won CASP 13 in 2018, according to the team at DeepMind.Jeremy Kahn
Lessons from DeepMind's breakthrough in protein-folding A.I.
''
Fortune Fortune may refer to: General * Fortuna or Fortune, the Roman goddess of luck * Luck * Wealth * Fortune, a prediction made in fortune-telling * Fortune, in a fortune cookie Arts and entertainment Film and television * ''The Fortune'' (1931 film) ...
'', 1 December 2020
The DeepMind team had identified that its previous approach, combining local physics with a guide potential derived from pattern recognition, had a tendency to over-account for interactions between residues that were nearby in the sequence compared to interactions between residues further apart along the chain. As a result, AlphaFold 1 had a tendency to prefer models with slightly more
secondary structure Protein secondary structure is the three dimensional conformational isomerism, form of ''local segments'' of proteins. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
(
alpha helices The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues ear ...
and
beta sheet The beta sheet, (β-sheet) (also β-pleated sheet) is a common motif of the regular protein secondary structure. Beta sheets consist of beta strands (β-strands) connected laterally by at least two or three backbone hydrogen bonds, forming a g ...
s) than was the case in reality (a form of
overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
).John Jumper et al., conference abstract (December 2020) The software design used in AlphaFold 1 contained a number of modules, each trained separately, that were used to produce the guide potential that was then combined with the physics-based energy potential. AlphaFold 2 replaced this with a system of sub-networks coupled together into a single differentiable end-to-end model, based entirely on pattern recognition, which was trained in an integrated way as a single integrated structure.See block diagram. Also John Jumper ''et al.'' (1 December 2020)
AlphaFold 2 presentation
slide 10
Local physics, in the form of energy refinement based on the
AMBER Amber is fossilized tree resin that has been appreciated for its color and natural beauty since Neolithic times. Much valued from antiquity to the present as a gemstone, amber is made into a variety of decorative objects."Amber" (2004). In Ma ...
model, is applied only as a final refinement step once the neural network prediction has converged, and only slightly adjusts the predicted structure. A key part of the 2020 system are two modules, believed to be based on a
transformer A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
design, which are used to progressively refine a vector of information for each relationship (or "
edge Edge or EDGE may refer to: Technology Computing * Edge computing, a network load-balancing system * Edge device, an entry point to a computer network * Adobe Edge, a graphical development application * Microsoft Edge, a web browser developed by ...
" in graph-theory terminology) between an
amino acid residue Protein structure is the molecular geometry, three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single ami ...
of the protein and another amino acid residue (these relationships are represented by the array shown in green); and between each amino acid position and each different sequences in the input
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alig ...
(these relationships are represented by the array shown in red). Internally these refinement transformations contain layers that have the effect of bringing relevant data together and filtering out irrelevant data (the "attention mechanism") for these relationships, in a context-dependent way, learnt from training data. These transformations are iterated, the updated information output by one step becoming the input of the next, with the sharpened residue/residue information feeding into the update of the residue/sequence information, and then the improved residue/sequence information feeding into the update of the residue/residue information. As the iteration progresses, according to one report, the "attention algorithm ... mimics the way a person might assemble a jigsaw puzzle: first connecting pieces in small clumps—in this case clusters of amino acids—and then searching for ways to join the clumps in a larger whole." The output of these iterations then informs the final structure prediction module, which also uses transformers, and is itself then iterated. In an example presented by DeepMind, the structure prediction module achieved a correct topology for the target protein on its first iteration, scored as having a GDT_TS of 78, but with a large number (90%) of stereochemical violations – i.e. unphysical bond angles or lengths. With subsequent iterations the number of stereochemical violations fell. By the third iteration the GDT_TS of the prediction was approaching 90, and by the eighth iteration the number of stereochemical violations was approaching zero.John Jumper ''et al.'' (1 December 2020)
AlphaFold 2 presentation
slides 12 to 20
The AlphaFold team stated in November 2020 that they believe AlphaFold can be further developed, with room for further improvements in accuracy. The training data was originally restricted to single peptide trains. However, the October 2021 update, named AlphaFold-Multimer, included protein complexes in its training data. DeepMind stated this update succeeded about 70% of the time at accurately predicting protein-protein interactions.


Competitions


CASP13

In December 2018, DeepMind's AlphaFold placed first in the overall rankings of the 13th
Critical Assessment of Techniques for Protein Structure Prediction Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP prov ...
(CASP). The program was particularly successfully predicting the most accurate structure for targets rated as the most difficult by the competition organisers, where no existing template structures were available from proteins with a partially similar sequence. AlphaFold gave the best prediction for 25 out of 43 protein targets in this class, achieving a median score of 58.9 on the CASP's global distance test (GDT) score, ahead of 52.5 and 52.4 by the two next best-placed teams, who were also using deep learning to estimate contact distances. Overall, across all targets, the program achieved a GDT score of 68.5. In January 2020, implementations and illustrative code of AlphaFold 1 was released
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
on
GitHub GitHub, Inc. () is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous ...
. but, as stated in the "Read Me" file on that website: "This code can't be used to predict structure of an arbitrary protein sequence. It can be used to predict structure only on the CASP13 dataset (links below). The feature generation code is tightly coupled to our internal infrastructure as well as external tools, hence we are unable to open-source it." Therefore, in essence, the code deposited is not suitable for general use but only for the CASP13 proteins. The company has not announced plans to make their code publicly available as of 5 March 2021.


CASP14

In November 2020, DeepMind's new version, AlphaFold 2, won CASP14. Overall, AlphaFold 2 made the best prediction for 88 out of the 97 targets. On the competition's preferred global distance test (GDT) measure of accuracy, the program achieved a median score of 92.4 (out of 100), meaning that more than half of its predictions were scored at better than 92.4% for having their atoms in more-or-less the right place, a level of accuracy reported to be comparable to experimental techniques like
X-ray crystallography X-ray crystallography is the experimental science determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to diffract into many specific directions. By measuring the angles ...
. In 2018 AlphaFold 1 had only reached this level of accuracy in two of all of its predictions. 88% of predictions in the 2020 competition had a GDT_TS score of more than 80. On the group of targets classed as the most difficult, AlphaFold 2 achieved a median score of 87. Measured by the
root-mean-square deviation The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents ...
(RMS-D) of the placement of the alpha-carbon atoms of the protein backbone chain, which tends to be dominated by the performance of the worst-fitted outliers, 88% of AlphaFold 2's predictions had an RMS deviation of less than 4 Å for the set of overlapped C-alpha atoms. 76% of predictions achieved better than 3 Å, and 46% had a C-alpha atom RMS accuracy better than 2 Å.,Mohammed AlQuraishi
CASP14 scores just came out and they’re astounding
Twitter, 30 November 2020.
with a median RMS deviation in its predictions of 2.1 Å for a set of overlapped CA atoms. AlphaFold 2 also achieved an accuracy in modelling surface
side chain In organic chemistry and biochemistry, a side chain is a chemical group that is attached to a core part of the molecule called the "main chain" or backbone. The side chain is a hydrocarbon branching element of a molecule that is attached to a l ...
s described as "really really extraordinary". To additionally verify AlphaFold-2 the conference organisers approached four leading experimental groups for structures they were finding particularly challenging and had been unable to determine. In all four cases the three-dimensional models produced by AlphaFold 2 were sufficiently accurate to determine structures of these proteins by
molecular replacement Molecular replacement (or MR) is a method of solving the phase problem in X-ray crystallography. MR relies upon the existence of a previously solved protein structure which is similar to our unknown structure from which the diffraction data is de ...
. These included target T1100 (Af1503), a small
membrane protein Membrane proteins are common proteins that are part of, or interact with, biological membranes. Membrane proteins fall into several broad categories depending on their location. Integral membrane proteins are a permanent part of a cell membrane ...
studied by experimentalists for ten years. Of the three structures that AlphaFold 2 had the least success in predicting, two had been obtained by
protein NMR Nuclear magnetic resonance spectroscopy of proteins (usually abbreviated protein NMR) is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins, and also nucleic acids, and ...
methods, which define protein structure directly in aqueous solution, whereas AlphaFold was mostly trained on protein structures in crystals. The third exists in nature as a multidomain complex consisting of 52 identical copies of the same
domain Domain may refer to: Mathematics *Domain of a function, the set of input values for which the (total) function is defined **Domain of definition of a partial function **Natural domain of a partial function **Domain of holomorphy of a function * Do ...
, a situation AlphaFold was not programmed to consider. For all targets with a single domain, excluding only one very large protein and the two structures determined by NMR, AlphaFold 2 achieved a GDT_TS score of over 80.


Responses

AlphaFold 2 scoring more than 90 in
CASP Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP prov ...
's global distance test (GDT) is considered a significant achievement in
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
and great progress towards a decades-old grand challenge of biology.
Nobel Prize The Nobel Prizes ( ; sv, Nobelpriset ; no, Nobelprisen ) are five separate prizes that, according to Alfred Nobel's will of 1895, are awarded to "those who, during the preceding year, have conferred the greatest benefit to humankind." Alfr ...
winner and
structural biologist Structural biology is a field that is many centuries old which, and as defined by the Journal of Structural Biology, deals with structural analysis of living material (formed, composed of, and/or maintained and refined by living cells) at every le ...
Venki Ramakrishnan Venkatraman Ramakrishnan (born 1952) is an Indian-born British and American structural biologist who shared the 2009 Nobel Prize in Chemistry with Thomas A. Steitz and Ada Yonath, "for studies of the structure and function of the ribosome". ...
called the result "a stunning advance on the protein folding problem", adding that "It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research." Propelled by press releases from CASP and DeepMind,Artificial intelligence solution to a 50-year-old science challenge could ‘revolutionise’ medical research
(press release),
CASP Critical Assessment of Structure Prediction (CASP), sometimes called Critical Assessment of Protein Structure Prediction, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP prov ...
organising committee, 30 November 2020
AlphaFold 2's success received wide media attention. As well as news pieces in the specialist science press, such as ''
Nature Nature, in the broadest sense, is the physics, physical world or universe. "Nature" can refer to the phenomenon, phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. ...
'', ''
Science Science is a systematic endeavor that builds and organizes knowledge in the form of testable explanations and predictions about the universe. Science may be as old as the human species, and some of the earliest archeological evidence for ...
'', ''
MIT Technology Review ''MIT Technology Review'' is a bimonthly magazine wholly owned by the Massachusetts Institute of Technology, and editorially independent of the university. It was founded in 1899 as ''The Technology Review'', and was re-launched without "The" in ...
'', and ''
New Scientist ''New Scientist'' is a magazine covering all aspects of science and technology. Based in London, it publishes weekly English-language editions in the United Kingdom, the United States and Australia. An editorially separate organisation publishe ...
'', the story was widely covered by major national newspapers, as well as general news-services and weekly publications, such as ''
Fortune Fortune may refer to: General * Fortuna or Fortune, the Roman goddess of luck * Luck * Wealth * Fortune, a prediction made in fortune-telling * Fortune, in a fortune cookie Arts and entertainment Film and television * ''The Fortune'' (1931 film) ...
'', ''
The Economist ''The Economist'' is a British weekly newspaper printed in demitab format and published digitally. It focuses on current affairs, international business, politics, technology, and culture. Based in London, the newspaper is owned by The Econo ...
'',
Bloomberg Bloomberg may refer to: People * Daniel J. Bloomberg (1905–1984), audio engineer * Georgina Bloomberg (born 1983), professional equestrian * Michael Bloomberg (born 1942), American businessman and founder of Bloomberg L.P.; politician and ma ...
, ''
Der Spiegel ''Der Spiegel'' (, lit. ''"The Mirror"'') is a German weekly news magazine published in Hamburg. With a weekly circulation of 695,100 copies, it was the largest such publication in Europe in 2011. It was founded in 1947 by John Seymour Chaloner ...
'',Julia Merlot
Forscher hoffen auf Durchbruch für die Medikamentenforschung
(Researchers hope for a breakthrough for drug research), ''
Der Spiegel ''Der Spiegel'' (, lit. ''"The Mirror"'') is a German weekly news magazine published in Hamburg. With a weekly circulation of 695,100 copies, it was the largest such publication in Europe in 2011. It was founded in 1947 by John Seymour Chaloner ...
'', 2 December 2020
and ''
The Spectator ''The Spectator'' is a weekly British magazine on politics, culture, and current affairs. It was first published in July 1828, making it the oldest surviving weekly magazine in the world. It is owned by Frederick Barclay, who also owns ''The ...
''. In London ''
The Times ''The Times'' is a British daily national newspaper based in London. It began in 1785 under the title ''The Daily Universal Register'', adopting its current name on 1 January 1788. ''The Times'' and its sister paper ''The Sunday Times'' (fou ...
'' made the story its front-page photo lead, with two further pages of inside coverage and an editorial. A frequent theme was that ability to predict protein structures accurately based on the constituent amino acid sequence is expected to have a wide variety of benefits in the life sciences space including accelerating advanced drug discovery and enabling better understanding of diseases. Writing about the event, the ''
MIT Technology Review ''MIT Technology Review'' is a bimonthly magazine wholly owned by the Massachusetts Institute of Technology, and editorially independent of the university. It was founded in 1899 as ''The Technology Review'', and was re-launched without "The" in ...
'' noted that the AI had "solved a fifty-year old grand challenge of biology." The same article went on to note that the AI algorithm could "predict the shape of proteins to within the width of an atom." As summed up by ''
Der Spiegel ''Der Spiegel'' (, lit. ''"The Mirror"'') is a German weekly news magazine published in Hamburg. With a weekly circulation of 695,100 copies, it was the largest such publication in Europe in 2011. It was founded in 1947 by John Seymour Chaloner ...
'' reservations about this coverage have focussed in two main areas: "There is still a lot to be done" and: "We don't even know how they do it".Christian Stöcker
Google greift nach dem Leben selbst
(Google is reaching for life itself), ''
Der Spiegel ''Der Spiegel'' (, lit. ''"The Mirror"'') is a German weekly news magazine published in Hamburg. With a weekly circulation of 695,100 copies, it was the largest such publication in Europe in 2011. It was founded in 1947 by John Seymour Chaloner ...
'', 6 December 2020
Although a 30-minute presentation about AlphaFold 2 was given on the second day of the CASP conference (December 1) by project leader John Jumper, it has been described as "exceedingly high-level, heavy on ideas and insinuations, but almost entirely devoid of detail". Unlike other research groups presenting at CASP14, DeepMind's presentation was not recorded and is not publicly available. DeepMind is expected to publish a scientific paper giving an account of AlphaFold 2 in the proceedings volume of the CASP conference; but it is not known whether it will go beyond what was said in the presentation. Speaking to ''
El País ''El País'' (; ) is a Spanish-language daily newspaper in Spain. ''El País'' is based in the capital city of Madrid and it is owned by the Spanish media conglomerate PRISA. It is the second most circulated daily newspaper in Spain . ''El Pa ...
'', researcher
Alfonso Valencia Alfonso Valencia is a Spanish biologist, ICREA Professor, current director of the Life Sciences department at Barcelona Supercomputing Center. and of Spanish National Bioinformatics Institute (INB-ISCIII). From 2015-2018, he was President of the In ...
said "The most important thing that this advance leaves us is knowing that this problem has a solution, that it is possible to solve it... We only know the result. Google does not provide the software and this is the frustrating part of the achievement because it will not directly benefit science."Nuño Dominguez
La inteligencia artificial arrasa en uno de los problemas más importantes de la biología
(Artificial intelligence takes out one of the most important problems in biology), ''
El País ''El País'' (; ) is a Spanish-language daily newspaper in Spain. ''El País'' is based in the capital city of Madrid and it is owned by the Spanish media conglomerate PRISA. It is the second most circulated daily newspaper in Spain . ''El Pa ...
'', 2 December 2020
Nevertheless, as much as Google and DeepMind do release may help other teams develop similar AI systems, an "indirect" benefit. In late 2019 DeepMind released much of the code of the first version of AlphaFold as open source; but only when work was well underway on the much more radical AlphaFold 2. Another option it could take might be to make AlphaFold 2 structure prediction available as an online black-box subscription service. Convergence for a single sequence has been estimated to require on the order of $10,000 worth of wholesale compute time.Carlos Outeiral
CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics
Oxford Protein Informatics Group. (3 December)
But this would deny researchers access to the internal states of the system, the chance to learn more qualitatively what gives rise to AlphaFold 2's success, and the potential for new algorithms that could be lighter and more efficient yet still achieve such results. Fears of potential for a lack of transparency by DeepMind have been contrasted with five decades of heavy public investment into the open
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...
and then also into open DNA sequence repositories, without which the data to train AlphaFold 2 would not have existed. Of note, on June 18th, 2021 Demis Hassabis tweeted: "Brief update on some exciting progress on #AlphaFold! We’ve been heads down working flat out on our full methods paper (currently under review) with accompanying open source code and on providing broad free access to AlphaFold for the scientific community. More very soon!"
Demis Hassabis Demis Hassabis (born 27 July 1976) is a British artificial intelligence researcher and entrepreneur. In his early career he was a video game AI programmer and designer, and an expert player of board games. He is the chief executive officer and ...

"Brief update on some exciting progress on #AlphaFold!"
(tweet), via
twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
, 18 June 2021
However it is not yet clear to what extent structure predictions made by AlphaFold 2 will hold up for proteins bound into complexes with other proteins and other molecules.Tom Ireland
How will AlphaFold change bioscience research?
''The Biologist'', 4 December 2020
This was not a part of the CASP competition which AlphaFold entered, and not an eventuality it was internally designed to expect. Where structures that AlphaFold 2 did predict were for proteins that had strong interactions either with other copies of themselves, or with other structures, these were the cases where AlphaFold 2's predictions tended to be least refined and least reliable. As a large fraction of the most important biological machines in a cell comprise such complexes, or relate to how protein structures become modified when in contact with other molecules, this is an area that will continue to be the focus of considerable experimental attention. With so little yet known about the internal patterns that AlphaFold 2 learns to make its predictions, it is not yet clear to what extent the program may be impaired in its ability to identify novel folds, if such folds are not well represented in the existing protein structures known in structure databases.Stephen Curry
No, DeepMind has not solved protein folding
Reciprocal Space (blog), 2 December 2020
It is also not well known the extent to which protein structures in such databases, overwhelmingly of proteins that it has been possible to crystallise to X-ray, are representative of typical proteins that have not yet been crystallised. And it is also unclear how representative the frozen protein structures in crystals are of the dynamic structures found in the cells ''in vivo''. AlphaFold 2's difficulties with structures obtained by
protein NMR Nuclear magnetic resonance spectroscopy of proteins (usually abbreviated protein NMR) is a field of structural biology in which NMR spectroscopy is used to obtain information about the structure and dynamics of proteins, and also nucleic acids, and ...
methods may not be a good sign. On its potential as a tool for
drug discovery In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which new candidate medications are discovered. Historically, drugs were discovered by identifying the active ingredient from traditional remedies or by ...
, Stephen Curry notes that while the resolution of AlphaFold 2's structures may be very good, the accuracy with which
binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may inclu ...
s are modelled needs to be even higher: typically molecular docking studies require the atomic positions to be accurate within a 0.3 Å margin, but the predicted protein structure only have at best an
RMSD The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD represents ...
of 0.9 Å for all atoms. So AlphaFold 2's structures may only be a limited help in such contexts. Moreover, according to ''Science'' columnist Derek Lowe, because the prediction of small-molecule binding even then is still not very good, computational prediction of drug targets is simply not in a position to take over as the "backbone" of corporate drug discovery—so "protein structure determination simply isn’t a rate-limiting step in drug discovery in general". It has also been noted that even with a structure for a protein, to then understand how it functions, what it does, and how that fits within wider biological processes can still be very challenging. Nevertheless, if better knowledge of protein structure could lead to better understanding of individual disease mechanisms and ultimately to better drug targets, or better understanding of the differences between human and animal models, ultimately that could lead to improvements. Also, because AlphaFold processes protein-only sequences by design, other associated biomolecules are not considered. On the impact of absent metals, co-factors and, most visibly, co- and post-translational modifications such as protein glycosylation from AlphaFold models, Elisa Fadda (Maynooth University, Ireland) and Jon Agirre (University of York, UK) highlighted the need for scientists to check databases such as UniProt-KB for likely missing components, as these can play an important role not just in folding but in protein function. However, the authors highlighted that many AlphaFold models were accurate enough to allow for the introduction of ''post-predictional'' modifications. Finally, some have noted that even a perfect answer to the protein ''
prediction A prediction (Latin ''præ-'', "before," and ''dicere'', "to say"), or forecast, is a statement about a future event or data. They are often, but not always, based upon experience or knowledge. There is no universal agreement about the exact ...
'' problem would still leave questions about the protein ''
folding Fold, folding or foldable may refer to: Arts, entertainment, and media * ''Fold'' (album), the debut release by Australian rock band Epicure * Fold (poker), in the game of poker, to discard one's hand and forfeit interest in the current pot *Abov ...
'' problem—understanding in detail how the folding process actually occurs in nature (and how sometimes they can also
misfold Protein folding is the physical process by which a protein chain is translated to its native three-dimensional structure, typically a "folded" conformation by which the protein becomes biologically functional. Via an expeditious and reproduci ...
). But even with such caveats, AlphaFold 2 was described as a huge technical step forward and intellectual achievement.


Protein Structure Database

The AlphaFold Protein Structure Database was launched on July 22, 2021 as a joint effort between AlphaFold and
EMBL-EBI The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...
. At launch the database contains AlphaFold-predicted
models A model is an informative representation of an object, person or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin ''modulus'', a measure. Models c ...
of protein structures of nearly the full
UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
proteome The proteome is the entire set of proteins that is, or can be, expressed by a genome, cell, tissue, or organism at a certain time. It is the set of expressed proteins in a given type of cell or organism, at a given time, under defined conditions. ...
of humans and 20
model organisms A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the working ...
, amounting to over 365,000 proteins. The database does not include proteins with fewer than 16 or more than 2700
amino acid residues Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monomer may ...
, but for humans they are available in the whole batch file. AlphaFold planned to add more sequences to the collection, the initial goal (as of beginning of 2022) being to cover most of the UniRef90 set of more than 100 million proteins. As of May 15, 2022, 992,316 predictions were available. In July 2021, UniProt-KB and
InterPro InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them. The contents of InterPro ...
has been updated to show AlphaFold predictions when available. On July 28, 2022, the team uploaded to the database the structures of around 200 million proteins from 1 million species, covering nearly every known protein on the planet.


Limitations

The AlphaFold DB uses a monomeric model similar to the CASP14 version. As a result, many of the same limitations are expected: * The DB model only predicts monomers, missing some important context in the form of
protein complexes A protein complex or multiprotein complex is a group of two or more associated polypeptide chains. Protein complexes are distinct from multienzyme complexes, in which multiple catalytic domains are found in a single polypeptide chain. Protein c ...
. The AlphaFold Multimer model is published separately as open-source, but pre-run models are not available. * The model is unreliable for
intrinsically disordered protein In molecular biology, an intrinsically disordered protein (IDP) is a protein that lacks a fixed or ordered three-dimensional structure, typically in the absence of its macromolecular interaction partners, such as other proteins or RNA. IDPs rang ...
s, although it does convey the information via a low confidence score. * The model is not validated for mutational analysis. * The model can only output one conformation of proteins with multiple conformations, with no control of which. * The model only predicts the main peptide chain, not the structures of missing co-factors, metals, and co- and post-translational modifications. This can be a large oversight for a number of biologically-relevant systems: between 50% and 70% of the structures of the human proteome are incomplete without covalently-attached glycans. On the other hand, since the model is trained from PDB models often with these modifications attached, the predicted structure is "frequently consistent with the expected structure in the presence of ions or cofactors".


Applications


SARS-CoV-2

AlphaFold has been used to predict structures of proteins of
SARS-CoV-2 Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic. The virus previously had a ...
, the causative agent of
COVID-19 Coronavirus disease 2019 (COVID-19) is a contagious disease caused by a virus, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The first known case was COVID-19 pandemic in Hubei, identified in Wuhan, China, in December ...
. The structures of these proteins were pending experimental detection in early 2020. Results were examined by the scientists at the
Francis Crick Institute The Francis Crick Institute (formerly the UK Centre for Medical Research and Innovation) is a biomedical research centre in London, which was established in 2010 and opened in 2016. The institute is a partnership between Cancer Research UK, Impe ...
in the United Kingdom before release into the larger research community. The team also confirmed accurate prediction against the experimentally determined SARS-CoV-2
spike protein In virology, a spike protein or peplomer protein is a protein that forms a large structure known as a spike or peplomer projecting from the surface of an enveloped virus. as cited in The proteins are usually glycoproteins that form dimers or ...
that was shared in the
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cry ...
, an international open-access database, before releasing the computationally determined structures of the under-studied protein molecules. The team acknowledged that although these protein structures might not be the subject of ongoing therapeutical research efforts, they will add to the community's understanding of the SARS-CoV-2 virus. Specifically, AlphaFold 2's prediction of the structure of the ''
ORF3a ORF3a (previously known as X1 or U274) is a gene found in coronaviruses of the subgenus ''Sarbecovirus'', including SARS-CoV and SARS-CoV-2. It encodes an accessory protein about 275 amino acid residues long, which is thought to function as a v ...
'' protein was very similar to the structure determined by researchers at
University of California, Berkeley The University of California, Berkeley (UC Berkeley, Berkeley, Cal, or California) is a public land-grant research university in Berkeley, California. Established in 1868 as the University of California, it is the state's first land-grant u ...
using
cryo-electron microscopy Cryogenic electron microscopy (cryo-EM) is a cryomicroscopy technique applied on samples cooled to cryogenic temperatures. For biological specimens, the structure is preserved by embedding in an environment of vitreous ice. An aqueous sample s ...
. This specific protein is believed to assist the virus in breaking out of the host cell once it replicates. This protein is also believed to play a role in triggering the inflammatory response to the infection.


Published works

* Andrew W. Senior ''et al.'' (December 2019)
"Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13)"
''Proteins: Structure, Function, Bioinformatics'' 87(12) 1141–1148 * Andrew W. Senior ''et al.'' (15 January 2020)
"Improved protein structure prediction using potentials from deep learning"
''
Nature Nature, in the broadest sense, is the physics, physical world or universe. "Nature" can refer to the phenomenon, phenomena of the physical world, and also to life in general. The study of nature is a large, if not the only, part of science. ...
'' 577 706–710 * John Jumper ''et al.'' (December 2020), "High Accuracy Protein Structure Prediction Using Deep Learning", in
Fourteenth Critical Assessment of Techniques for Protein Structure Prediction (Abstract Book)
', pp. 22–24 * John Jumper ''et al.'' (December 2020),
AlphaFold 2
. Presentation given at CASP 14.


See also

*
Folding@home Folding@home (FAH or F@h) is a volunteer computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements ...
*
IBM Blue Gene Blue Gene is an IBM project aimed at designing supercomputers that can reach operating speeds in the petaFLOPS (PFLOPS) range, with low power consumption. The project created three generations of supercomputers, Blue Gene/L, Blue Gene/P, ...
*
Foldit Foldit is an online puzzle video game about protein folding. It is part of an experimental research project developed by the University of Washington, Center for Game Science, in collaboration with the UW Department of Biochemistry. The objective ...
*
Rosetta@home Rosetta@home is a volunteer computing project researching protein structure prediction on the Berkeley Open Infrastructure for Network Computing (BOINC) platform, run by the Baker laboratory at the University of Washington. Rosetta@home aims ...
*
Human Proteome Folding Project The Human Proteome Folding Project (HPF) is a collaborative effort between New York University ( Bonneau Lab), the Institute for Systems Biology (ISB) and the University of Washington ( Baker Lab), using the Rosetta software developed by the Rosett ...
*
AlphaZero AlphaZero is a computer program developed by artificial intelligence research company DeepMind to master the games of chess, shogi and go. This algorithm uses an approach similar to AlphaGo Zero. On December 5, 2017, the DeepMind team rel ...
*
AlphaGo AlphaGo is a computer program that plays the board game Go (game), Go. It was developed by DeepMind Technologies a subsidiary of Google (now Alphabet Inc.). Subsequent versions of AlphaGo became increasingly powerful, including a version that ...


References


Further reading

* Carlos Outeiral
CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics
Oxford Protein Informatics Group. (3 December) * Mohammed AlQuraishi
AlphaFold2 @ CASP14: "It feels like one’s child has left home."
(blog), 8 December 2020 * Mohammed AlQuraishi
The AlphaFold2 Method Paper: A Fount of Good Ideas
(blog), 25 July 2021


External links


AlphaFold 1

* * *


AlphaFold 2

*
Open access to protein structure predictions for the human proteome and 20 other key organisms
at
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Well ...

CASP 14
website
AlphaFold: The making of a scientific breakthrough
DeepMind, via YouTube.
ColabFold
()
version
for homooligomeric prediction and complexes
AlphaFold Protein Structure Database
website {{Differentiable computing Bioinformatics software Applications of artificial intelligence Applied machine learning Protein folding Deep learning software applications