A protein superfamily is the largest grouping (

clade A clade (), also known as a monophyletic group or natural group, is a group of organisms that are monophyletic – that is, composed of a common ancestor and all its lineal descendants – on a phylogenetic tree. Rather than the English term, ...

) of proteins for which

common ancestry Common descent is a concept in evolutionary biology applicable when one species is the ancestor of two or more species later in time. All living beings are in fact descendants of a unique ancestor commonly referred to as the last universal comm ...

can be inferred (see homology). Usually this common ancestry is inferred from structural alignment and mechanistic similarity, even if no sequence similarity is evident.

Sequence homology Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a sp ...

can then be deduced even if not apparent (due to low sequence similarity). Superfamilies typically contain several protein families which show sequence similarity within each family. The term ''protein clan'' is commonly used for

protease A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the fo ...

and glycosyl hydrolases superfamilies based on the

MEROPS MEROPS is an online database for peptidases (also known as proteases, proteinases and proteolytic enzymes) and their inhibitors. The classification scheme for peptidases was published by Rawlings & Barrett in 1993, and that for protein inhibitor ...

and

CAZy CAZy is a database of Carbohydrate-Active enZYmes (CAZymes). The database contains a classification and associated information about enzymes involved in the synthesis, metabolism, and recognition of complex carbohydrates, i.e. disaccharides, olig ...

classification systems.

Identification

Superfamilies of proteins are identified using a number of methods. Closely related members can be identified by different methods to those needed to group the most evolutionarily divergent members.

Sequence similarity

Historically, the similarity of different amino acid sequences has been the most common method of inferring homology. Sequence similarity is considered a good predictor of relatedness, since similar sequences are more likely the result of

gene duplication Gene duplication (or chromosomal duplication or gene amplification) is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. ...

and divergent evolution, rather than the result of

convergent evolution Convergent evolution is the independent evolution of similar features in species of different periods or epochs in time. Convergent evolution creates analogous structures that have similar form or function but were not present in the last com ...

. Amino acid sequence is typically more conserved than DNA sequence (due to the degenerate genetic code), so is a more sensitive detection method. Since some of the amino acids have similar properties (e.g., charge, hydrophobicity, size),

conservative mutation A conservative replacement (also called a conservative mutation or a conservative substitution) is an amino acid replacement in a protein that changes a given amino acid to a different amino acid with similar biochemical properties (e.g. charge, ...

s that interchange them are often neutral to function. The most conserved sequence regions of a protein often correspond to functionally important regions like catalytic sites and binding sites, since these regions are less tolerant to sequence changes. Using sequence similarity to infer homology has several limitations. There is no minimum level of sequence similarity guaranteed to produce identical structures. Over long periods of evolution, related proteins may show no detectable sequence similarity to one another. Sequences with many insertions and deletions can also sometimes be difficult to align and so identify the homologous sequence regions. In the PA clan of

s, for example, not a single residue is conserved through the superfamily, not even those in the

catalytic triad A catalytic triad is a set of three coordinated amino acids that can be found in the active site of some enzymes. Catalytic triads are most commonly found in hydrolase and transferase enzymes (e.g. proteases, amidases, esterases, acylases, li ...

. Conversely, the individual families that make up a superfamily are defined on the basis of their sequence alignment, for example the C04 protease family within the PA clan. Nevertheless, sequence similarity is the most commonly used form of evidence to infer relatedness, since the number of known sequences vastly outnumbers the number of known tertiary structures. In the absence of structural information, sequence similarity constrains the limits of which proteins can be assigned to a superfamily.

Structural similarity

Structure A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such ...

is much more evolutionarily conserved than sequence, such that proteins with highly similar structures can have entirely different sequences. Over very long evolutionary timescales, very few residues show detectable amino acid sequence conservation, however secondary structural elements and tertiary structural motifs are highly conserved. Some protein dynamics and

conformational change In biochemistry, a conformational change is a change in the shape of a macromolecule, often induced by environmental factors. A macromolecule is usually flexible and dynamic. Its shape can change in response to changes in its environment or othe ...

s of the protein structure may also be conserved, as is seen in the serpin superfamily. Consequently, protein tertiary structure can be used to detect homology between proteins even when no evidence of relatedness remains in their sequences. Structural alignment programs, such as DALI, use the 3D structure of a protein of interest to find proteins with similar folds. However, on rare occasions, related proteins may evolve to be structurally dissimilar and relatedness can only be inferred by other methods.

Mechanistic similarity

The

catalytic mechanism Enzyme catalysis is the increase in the rate of a process by a biological molecule, an "enzyme". Most enzymes are proteins, and most such processes are chemical reactions. Within the enzyme, generally catalysis occurs at a localized site, calle ...

of enzymes within a superfamily is commonly conserved, although substrate specificity may be significantly different. Catalytic residues also tend to occur in the same order in the protein sequence. For the families within the PA clan of proteases, although there has been divergent evolution of the

residues used to perform catalysis, all members use a similar mechanism to perform covalent, nucleophilic catalysis on proteins, peptides or amino acids. However, mechanism alone is not sufficient to infer relatedness. Some catalytic mechanisms have been convergently evolved multiple times independently, and so form separate superfamilies, and in some superfamilies display a range of different (though often chemically similar) mechanisms.

Evolutionary significance

Protein superfamilies represent the current limits of our ability to identify common ancestry. They are the largest

evolutionary Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation t ...

grouping based on direct

evidence Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field. In epistemology, evidenc ...

that is currently possible. They are therefore amongst the most ancient evolutionary events currently studied. Some superfamilies have members present in all kingdoms of

life Life is a quality that distinguishes matter that has biological processes, such as signaling and self-sustaining processes, from that which does not, and is defined by the capacity for growth, reaction to stimuli, metabolism, energy ...

, indicating that the last common ancestor of that superfamily was in the

last universal common ancestor The last universal common ancestor (LUCA) is the most recent population from which all organisms now living on Earth share common descent—the most recent common ancestor of all current life on Earth. This includes all cellular organisms; th ...

of all life (LUCA). Superfamily members may be in different species, with the ancestral protein being the form of the protein that existed in the ancestral species ( orthology). Conversely, the proteins may be in the same species, but evolved from a single protein whose gene was duplicated in the genome ( paralogy).

Diversification

A majority of proteins contain multiple domains. Between 66-80% of eukaryotic proteins have multiple domains while about 40-60% of prokaryotic proteins have multiple domains. Over time, many of the superfamilies of domains have mixed together. In fact, it is very rare to find “consistently isolated superfamilies”. When domains do combine, the N- to C-terminal domain order (the "domain architecture") is typically well conserved. Additionally, the number of domain combinations seen in nature is small compared to the number of possibilities, suggesting that selection acts on all combinations.

Examples

; α/β hydrolase superfamily: Members share an α/β sheet, containing 8 strands connected by

helices A helix () is a shape like a corkscrew or spiral staircase. It is a type of smooth space curve with tangent lines at a constant angle to a fixed axis. Helices are important in biology, as the DNA molecule is formed as two intertwined hel ...

, with

residues in the same order, activities include

proteases A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalyzes (increases reaction rate or "speeds up") proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the form ...

lipases Lipase ( ) is a family of enzymes that catalyzes the hydrolysis of fats. Some lipases display broad substrate scope including esters of cholesterol, phospholipids, and of lipid-soluble vitamins and sphingomyelinases; however, these are usually t ...

peroxidases Peroxidases or peroxide reductases ( EC numberbr>1.11.1.x are a large group of enzymes which play a role in various biological processes. They are named after the fact that they commonly break up peroxides. Functionality Peroxidases typically ca ...

, esterases,

epoxide hydrolase Epoxide hydrolases (EH's), also known as epoxide hydratases, are enzymes that metabolize compounds that contain an epoxide residue; they convert this residue to two hydroxyl residues through an epoxide hydrolysis reaction to form diol products ...

s and

dehalogenase A dehalogenase is a type of enzyme that catalyzes the removal of a halogen atom from a substrate. Examples include: * Reductive dehalogenases * 4-chlorobenzoate dehalogenase * 4-chlorobenzoyl-CoA dehalogenase * Dichloromethane dehalogenase * ...

s. ; Alkaline phosphatase superfamily: Members share an αβα sandwich structure as well as performing common promiscuous reactions by a common mechanism. ; Globin superfamily: Members share an 8-

alpha helix The alpha helix (α-helix) is a common motif in the secondary structure of proteins and is a right hand-helix conformation in which every backbone N−H group hydrogen bonds to the backbone C=O group of the amino acid located four residues ...

globular

globin fold The globins are a superfamily of heme-containing globular proteins, involved in binding and/or transporting oxygen. These proteins all incorporate the globin fold, a series of eight alpha helical segments. Two prominent members include myog ...

. ;

Immunoglobulin superfamily The immunoglobulin superfamily (IgSF) is a large protein superfamily of cell surface and soluble proteins that are involved in the recognition, binding, or adhesion processes of cells. Molecules are categorized as members of this superfamily ba ...

: Members share a sandwich-like structure of two sheets of antiparallel β strands ( Ig-fold), and are involved in recognition, binding, and

adhesion Adhesion is the tendency of dissimilar particles or surfaces to cling to one another ( cohesion refers to the tendency of similar or identical particles/surfaces to cling to one another). The forces that cause adhesion and cohesion can be ...

. ; PA clan: Members share a

chymotrypsin Chymotrypsin (, chymotrypsins A and B, alpha-chymar ophth, avazyme, chymar, chymotest, enzeon, quimar, quimotrase, alpha-chymar, alpha-chymotrypsin A, alpha-chymotrypsin) is a digestive enzyme component of pancreatic juice acting in the duodenu ...

-like double β-barrel fold and similar

proteolysis Proteolysis is the breakdown of proteins into smaller polypeptides or amino acids. Uncatalysed, the hydrolysis of peptide bonds is extremely slow, taking hundreds of years. Proteolysis is typically catalysed by cellular enzymes called proteas ...

mechanisms but sequence identity of <10%. The clan contains both

cysteine Cysteine (symbol Cys or C; ) is a semiessential proteinogenic amino acid with the formula . The thiol side chain in cysteine often participates in enzymatic reactions as a nucleophile. When present as a deprotonated catalytic residue, sometim ...

and serine proteases (different nucleophiles). ;

Ras superfamily The Ras superfamily, derived from "Rat sarcoma virus", is a protein superfamily of small GTPases. Members of the superfamily are divided into families and subfamilies based on their structure, sequence and function. The five main families are Ras ...

: Members share a common catalytic G domain of a 6-strand β sheet surrounded by 5 α-helices. ; RSH superfamily: Members share capability to hydrolyze and/or synthesize ppGpp alarmones in the stringent response. ; Serpin superfamily: Members share a high-energy, stressed fold which can undergo a large

, which is typically used to inhibit

serine Serine (symbol Ser or S) is an α-amino acid that is used in the biosynthesis of proteins. It contains an α-amino group (which is in the protonated − form under biological conditions), a carboxyl group (which is in the deprotonated − form ...

and cysteine proteases by disrupting their structure. ; TIM barrel superfamily: Members share a large α₈β₈ barrel structure. It is one of the most common protein folds and the monophylicity of this superfamily is still contested.

Protein superfamily resources

Several

biological databases Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...

document protein superfamilies and protein folds, for example: *

Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 35.0, was released in November 2021 and contains 19,632 families. Uses ...

- Protein families database of alignments and HMMs *

PROSITE PROSITE is a protein database. It consists of entries describing the protein families, domains and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team of the Swiss Institute of Bioinformati ...

- Database of protein domains, families and functional sites * PIRSF - SuperFamily Classification System * PASS2 - Protein Alignment as Structural Superfamilies v2 * SUPERFAMILY - Library of HMMs representing superfamilies and database of (superfamily and family) annotations for all completely sequenced organisms *

SCOP A ( or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designa ...

and

CATH The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and col ...

- Classifications of protein structures into superfamilies, families and domains Similarly there are algorithms that search the PDB for proteins with structural homology to a target structure, for example: * DALI - Structural alignment based on a distance alignment matrix method

References

External links

* {{Enzymes Molecular evolution * * Protein classification