Threading (protein Sequence)
   HOME

TheInfoList



OR:

In
molecular biology Molecular biology is a branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, biomolecular synthesis, modification, mechanisms, and interactio ...
, protein threading, also known as fold recognition, is a method of protein modeling which is used to model those
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s which have the same fold as proteins of known
structures A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
, but do not have homologous proteins with known structure. It differs from the
homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
method of structure prediction as it (protein threading) is used for proteins which do not have their homologous protein structures deposited in the
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids, which is overseen by the Worldwide Protein Data Bank (wwPDB). This structural data is obtained a ...
(PDB), whereas homology modeling is used for those proteins which do. Threading works by using statistical knowledge of the relationship between the structures deposited in the PDB and the
sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is cal ...
of the protein which one wishes to model. The prediction is made by "threading" (i.e. placing, aligning) each
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
in the target sequence to a position in the template structure, and evaluating how well the target fits the template. After the best-fit template is selected, the structural model of the sequence is built based on the alignment with the chosen template. Protein threading is based on two basic observations: that the number of different folds in nature is fairly small (approximately 1300); and that 90% of the new structures submitted to the PDB in the past three years have similar structural folds to ones already in the PDB.


Classification of protein structure

The
Structural Classification of Proteins database The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine t ...
(SCOP) provides a detailed and comprehensive description of the structural and evolutionary relationships of known structure. Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are
family Family (from ) is a Social group, group of people related either by consanguinity (by recognized birth) or Affinity (law), affinity (by marriage or other relationship). It forms the basis for social order. Ideally, families offer predictabili ...
,
superfamily SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, str ...
, and fold: * Family (clear evolutionary relationship): Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absence of high sequence identity; for example, many
globin The globins are a superfamily of heme-containing globular proteins, involved in binding and/or transporting oxygen. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members include myo ...
s form a family though some members have sequence identities of only 15%. * Superfamily (probable common evolutionary origin): Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies. For example,
actin Actin is a family of globular multi-functional proteins that form microfilaments in the cytoskeleton, and the thin filaments in muscle fibrils. It is found in essentially all eukaryotic cells, where it may be present at a concentration of ...
, the
ATPase ATPases (, Adenosine 5'-TriPhosphatase, adenylpyrophosphatase, ATP monophosphatase, triphosphatase, ATP hydrolase, adenosine triphosphatase) are a class of enzymes that catalyze the decomposition of ATP into ADP and a free phosphate ion or ...
domain of the
heat shock protein Heat shock proteins (HSPs) are a family of proteins produced by cells in response to exposure to stressful conditions. They were first described in relation to heat shock, but are now known to also be expressed during other stresses including ex ...
, and hexokinase together form a superfamily. * Fold (major structural similarity): Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies.


Method

A general paradigm of protein threading consists of the following four steps: # The construction of a structure template
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
: Select protein structures from the protein structure databases as structural templates. This generally involves selecting protein structures from databases such as
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules such as proteins and nucleic acids, which is overseen by the Worldwide Protein Data Bank (wwPDB). This structural data is obtained a ...
(PDB), Families of Structurally Similar Proteins database (FSSP),
Structural Classification of Proteins database The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine t ...
(SCOP), or
CATH database The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and ...
, after removing protein structures with high sequence similarities. # The design of the scoring function: Design a good scoring function to measure the fitness between target sequences and templates based on the knowledge of the known relationships between the structures and the sequences. A good scoring function should contain mutation potential, environment fitness potential, pairwise potential, secondary structure compatibilities, and gap penalties. The quality of the energy function is closely related to the prediction accuracy, especially the alignment accuracy. # Threading alignment: Align the target sequence with each of the structure templates by optimizing the designed scoring function. This step is one of the major tasks of all threading-based structure prediction programs that take into account the pairwise contact potential; otherwise, a dynamic programming algorithm can fulfill it. # Threading prediction: Select the threading alignment that is statistically most probable as the threading prediction. Then construct a structure model for the target by placing the backbone atoms of the target sequence at their aligned backbone positions of the selected structural template.


Comparison with homology modeling

Homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
and protein threading are both template-based methods and there is no rigorous boundary between them in terms of prediction techniques. But the protein structures of their targets are different. Homology modeling is for those targets which have homologous proteins with known structure (usually/maybe of same family), while protein threading is for those targets with only fold-level homology found. In other words, homology modeling is for "easier" targets and protein threading is for "harder" targets. Homology modeling treats the template in an alignment as a sequence, and only sequence homology is used for prediction. Protein threading treats the template in an alignment as a structure, and both sequence and structure information extracted from the alignment are used for prediction. When there is no significant homology found, protein threading can make a prediction based on the structure information. That also explains why protein threading may be more effective than homology modeling in many cases. In practice, when the sequence identity in a sequence sequence alignment is low (i.e. <25%), homology modeling may not produce a significant prediction. In this case, if there is distant homology found for the target, protein threading can generate a good prediction.


More about threading

Fold recognition methods can be broadly divided into two types: those that derive a 1-D profile for each structure in the fold library and align the target sequence to these profiles; and those that consider the full 3-D structure of the protein template. A simple example of a profile representation would be to take each amino acid in the structure and simply label it according to whether it is buried in the core of the protein or exposed on the surface. More elaborate profiles might take into account the local
secondary structure Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
(e.g. whether the amino acid is part of an
alpha helix An alpha helix (or α-helix) is a sequence of amino acids in a protein that are twisted into a coil (a helix). The alpha helix is the most common structural arrangement in the Protein secondary structure, secondary structure of proteins. It is al ...
) or even evolutionary information (how conserved the amino acid is). In the 3-D representation, the structure is modeled as a set of inter-atomic distances, i.e. the distances are calculated between some or all of the atom pairs in the structure. This is a much richer and far more flexible description of the structure, but is much harder to use in calculating an alignment. The profile-based fold recognition approach was first described by Bowie, Lüthy and
David Eisenberg David S. Eisenberg (born 15 March 1939) is an American biochemist and Biophysics, biophysicist best known for his contributions to structural biology and computational molecular biology. He has been a professor at the University of California, ...
in 1991. The term ''threading'' was first coined by David Jones, William R. Taylor and
Janet Thornton Dame Janet Maureen Thornton, (born 23 May 1949) is a senior scientist and director emeritus at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL). She is one of the world's leading researcher ...
in 1992, and originally referred specifically to the use of a full 3-D structure atomic representation of the protein template in fold recognition. Today, the terms threading and fold recognition are frequently (though somewhat incorrectly) used interchangeably. Fold recognition methods are widely used and effective because it is believed that there are a strictly limited number of different protein folds in nature, mostly as a result of evolution but also due to constraints imposed by the basic physics and chemistry of polypeptide chains. There is, therefore, a good chance (currently 70-80%) that a protein which has a similar fold to the target protein has already been studied by
X-ray crystallography X-ray crystallography is the experimental science of determining the atomic and molecular structure of a crystal, in which the crystalline structure causes a beam of incident X-rays to Diffraction, diffract in specific directions. By measuring th ...
or nuclear magnetic resonance (NMR) spectroscopy and can be found in the PDB. Currently there are nearly 1300 different protein folds known, but new folds are still being discovered every year due in significant part to the ongoing
structural genomics Structural genomics seeks to describe the Protein Structure, 3-dimensional structure of every protein encoded by a given genome. This genome-based approach allows for a high-throughput method of structure determination by a combination of protein ...
projects. Many different algorithms have been proposed for finding the correct threading of a sequence onto a structure, though many make use of dynamic programming in some form. For full 3-D threading, the problem of identifying the best alignment is very difficult (it is an
NP-hard In computational complexity theory, a computational problem ''H'' is called NP-hard if, for every problem ''L'' which can be solved in non-deterministic polynomial-time, there is a polynomial-time reduction from ''L'' to ''H''. That is, assumi ...
problem for some models of threading). Researchers have made use of many combinatorial optimization methods such as
conditional random field Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...
s,
simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. ...
,
branch and bound Branch and bound (BB, B&B, or BnB) is a method for solving optimization problems by breaking them down into smaller sub-problems and using a bounding function to eliminate sub-problems that cannot contain the optimal solution. It is an algorithm ...
, and
linear programming Linear programming (LP), also called linear optimization, is a method to achieve the best outcome (such as maximum profit or lowest cost) in a mathematical model whose requirements and objective are represented by linear function#As a polynomia ...
, searching to arrive at heuristic solutions. It is interesting to compare threading methods to methods which attempt to align two protein structures (
protein structural alignment Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large R ...
), and indeed many of the same algorithms have been applied to both problems.


Protein threading software

* HHpred is a popular threading server which runs HHsearch, a widely used software for remote homology detection based on pairwise comparison of
hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
s. *
RAPTOR Raptor(s) or RAPTOR may refer to: Animals The word "raptor" refers to several groups of avian and non-avian dinosaurs which primarily capture and subdue/kill prey with their talons. * Raptor (bird) or bird of prey, a bird that primarily hunt ...
is an integer programming based protein threading software. It has been replaced by a new protein threading program
RaptorX RaptorX is a software and web server for protein structure and function prediction that is free for non-commercial use. RaptorX is among the most popular methods for protein structure prediction. Like other remote homology recognition and protei ...
, which employs probabilistic graphical models and statistical inference to both single template and multi-template based protein threading. RaptorX significantly outperforms RAPTOR and is especially good at aligning proteins with sparse sequence profile. The RaptorX server is free to public. *
Phyre Phyre and Phyre2 (Protein Homology/AnalogY Recognition Engine; pronounced as ''fire'') are free web-based services for protein structure prediction. Phyre is among the most popular methods for protein structure prediction having been cited over 1 ...
is a popular threading server combining HHsearch with ''ab initio'' and multiple-template modelling. * MUSTER is a standard threading algorithm based on dynamic programming and sequence profile-profile alignment. It also combines multiple structural resources to assist the sequence profile alignment. * SPARKS X is a probabilistic-based sequence-to-structure matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. * BioShell is a threading algorithm using optimized profile-to-profile dynamic programming algorithm combined with predicted secondary structure.


See also

*
Homology modeling Homology modeling, also known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "''target''" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous pr ...
*
Protein structure prediction Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its Protein secondary structure, secondary and Protein tertiary structure, tertiary structure ...
*
Protein structure prediction software This list of protein structure prediction software summarizes notable used software tools in protein structure prediction, including homology modeling, protein threading, ''ab initio'' methods, Protein structure prediction#Secondary structure, se ...


References


Further reading

* * * * * {{Refend Protein methods Bioinformatics NP-complete problems