Protein primary structure is the linear sequence of

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

s in a

peptide Peptides are short chains of amino acids linked by peptide bonds. A polypeptide is a longer, continuous, unbranched peptide chain. Polypeptides that have a molecular mass of 10,000 Da or more are called proteins. Chains of fewer than twenty am ...

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

. By convention, the primary structure of a protein is reported starting from the amino-terminal (N) end to the carboxyl-terminal (C) end.

Protein biosynthesis Protein biosynthesis, or protein synthesis, is a core biological process, occurring inside Cell (biology), cells, homeostasis, balancing the loss of cellular proteins (via Proteolysis, degradation or Protein targeting, export) through the produc ...

is most commonly performed by

ribosome Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...

s in cells. Peptides can also be synthesized in the laboratory. Protein primary structures can be directly sequenced, or inferred from

DNA sequences A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the ...

Formation

Biological

Amino acids are polymerised via peptide bonds to form a long backbone, with the different amino acid side chains protruding along it. In biological systems, proteins are produced during

translation Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...

by a cell's

s. Some organisms can also make short peptides by non-ribosomal peptide synthesis, which often use amino acids other than the encoded 22, and may be cyclised, modified and cross-linked.

Chemical

Peptides can be synthesised chemically via a range of laboratory methods. Chemical methods typically synthesise peptides in the opposite order (starting at the C-terminus) to biological protein synthesis (starting at the N-terminus).

Notation

Protein sequence is typically notated as a string of letters, listing the amino acids starting at the amino-terminal end through to the carboxyl-terminal end. Either a three letter code or single letter code can be used to represent the 22 naturally encoded amino acids, as well as mixtures or ambiguous amino acids (similar to

nucleic acid notation The nucleic acid notation currently in use was first formalized by the International Union of Pure and Applied Chemistry (IUPAC) in 1970. This universally accepted notation uses the Roman characters G, C, A, and T, to represent the four nucleotides ...

). Peptides can be directly sequenced, or inferred from

DNA sequence A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nu ...

s. Large

sequence database In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("Digital data, digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a ...

s now exist that collate known protein sequences.

Modification

In general, polypeptides are unbranched polymers, so their primary structure can often be specified by the sequence of

s along their backbone. However, proteins can become cross-linked, most commonly by disulfide bonds, and the primary structure also requires specifying the cross-linking atoms, e.g., specifying the

cysteine Cysteine (; symbol Cys or C) is a semiessential proteinogenic amino acid with the chemical formula, formula . The thiol side chain in cysteine enables the formation of Disulfide, disulfide bonds, and often participates in enzymatic reactions as ...

s involved in the protein's disulfide bonds. Other crosslinks include desmosine.

Isomerisation

The chiral centers of a polypeptide chain can undergo racemization. Although it does not change the sequence, it does affect the chemical properties of the sequence. In particular, the L-amino acids normally found in proteins can spontaneously isomerize at the

\mathrm

atom to form D-amino acids, which cannot be cleaved by most

protease A protease (also called a peptidase, proteinase, or proteolytic enzyme) is an enzyme that catalysis, catalyzes proteolysis, breaking down proteins into smaller polypeptides or single amino acids, and spurring the formation of new protein products ...

s. Additionally, proline can form stable trans-isomers at the peptide bond.

Post-translational modification

Additionally, the protein can undergo a variety of

post-translational modification In molecular biology, post-translational modification (PTM) is the covalent process of changing proteins following protein biosynthesis. PTMs may involve enzymes or occur spontaneously. Proteins are created by ribosomes, which translation (biolog ...

s, which are briefly summarized here. The N-terminal amino group of a polypeptide can be modified covalently, e.g., N-terminal acetylation

* acetylation

\mathrm

:The positive charge on the N-terminal amino group may be eliminated by changing it to an acetyl group (N-terminal blocking). * formylation

\mathrm

:The N-terminal methionine usually found after translation has an N-terminus blocked with a formyl group. This formyl group (and sometimes the methionine residue itself, if followed by Gly or Ser) is removed by the enzyme deformylase. * pyroglutamate Formation of pyroglutamate

:An N-terminal glutamine can attack itself, forming a cyclic pyroglutamate group. * myristoylation

\mathrm

:Similar to acetylation. Instead of a simple methyl group, the myristoyl group has a tail of 14 hydrophobic carbons, which make it ideal for anchoring proteins to cellular membranes. The C-terminal carboxylate group of a polypeptide can also be modified, e.g., C-terminal amidation

* amination (see Figure) :The C-terminus can also be blocked (thus, neutralizing its negative charge) by amination. * glycosyl phosphatidylinositol (GPI) attachment : Glycosyl phosphatidylinositol(GPI) is a large, hydrophobic phospholipid prosthetic group that anchors proteins to cellular membranes. It is attached to the polypeptide C-terminus through an amide linkage that then connects to ethanolamine, thence to sundry sugars and finally to the phosphatidylinositol lipid moiety. Finally, the peptide

side chain In organic chemistry and biochemistry, a side chain is a substituent, chemical group that is attached to a core part of the molecule called the "main chain" or backbone chain, backbone. The side chain is a hydrocarbon branching element of a mo ...

s can also be modified covalently, e.g., * phosphorylation :Aside from cleavage,

phosphorylation In biochemistry, phosphorylation is described as the "transfer of a phosphate group" from a donor to an acceptor. A common phosphorylating agent (phosphate donor) is ATP and a common family of acceptor are alcohols: : This equation can be writ ...

is perhaps the most important chemical modification of proteins. A phosphate group can be attached to the sidechain hydroxyl group of serine, threonine and tyrosine residues, adding a negative charge at that site and producing an unnatural amino acid. Such reactions are catalyzed by kinases and the reverse reaction is catalyzed by phosphatases. The phosphorylated tyrosines are often used as "handles" by which proteins can bind to one another, whereas phosphorylation of Ser/Thr often induces conformational changes, presumably because of the introduced negative charge. The effects of phosphorylating Ser/Thr can sometimes be simulated by mutating the Ser/Thr residue to glutamate. *

glycosylation Glycosylation is the reaction in which a carbohydrate (or ' glycan'), i.e. a glycosyl donor, is attached to a hydroxyl or other functional group of another molecule (a glycosyl acceptor) in order to form a glycoconjugate. In biology (but not ...

:A catch-all name for a set of very common and very heterogeneous chemical modifications. Sugar moieties can be attached to the sidechain hydroxyl groups of Ser/Thr or to the sidechain amide groups of Asn. Such attachments can serve many functions, ranging from increasing solubility to complex recognition. All glycosylation can be blocked with certain inhibitors, such as tunicamycin. * deamidation (succinimide formation) :In this modification, an asparagine or aspartate side chain attacks the following peptide bond, forming a symmetrical succinimide intermediate. Hydrolysis of the intermediate produces either aspartate or the β-amino acid, iso(Asp). For asparagine, either product results in the loss of the amide group, hence "deamidation". * hydroxylation : Proline residues may be hydroxylated at either of two atoms, as can lysine (at one atom). Hydroxyproline is a critical component of

collagen Collagen () is the main structural protein in the extracellular matrix of the connective tissues of many animals. It is the most abundant protein in mammals, making up 25% to 35% of protein content. Amino acids are bound together to form a trip ...

, which becomes unstable upon its loss. The hydroxylation reaction is catalyzed by an enzyme that requires

ascorbic acid Ascorbic acid is an organic compound with formula , originally called hexuronic acid. It is a white solid, but impure samples can appear yellowish. It dissolves freely in water to give mildly acidic solutions. It is a mild reducing agent. Asco ...

(vitamin C), deficiencies in which lead to many connective-tissue diseases such as

scurvy Scurvy is a deficiency disease (state of malnutrition) resulting from a lack of vitamin C (ascorbic acid). Early symptoms of deficiency include weakness, fatigue, and sore arms and legs. Without treatment, anemia, decreased red blood cells, gum d ...

. * methylation : Several protein residues can be methylated, most notably the positive groups of

lysine Lysine (symbol Lys or K) is an α-amino acid that is a precursor to many proteins. Lysine contains an α-amino group (which is in the protonated form when the lysine is dissolved in water at physiological pH), an α-carboxylic acid group ( ...

and

arginine Arginine is the amino acid with the formula (H2N)(HN)CN(H)(CH2)3CH(NH2)CO2H. The molecule features a guanidinium, guanidino group appended to a standard amino acid framework. At physiological pH, the carboxylic acid is deprotonated (−CO2−) a ...

. Arginine residues interact with the nucleic acid phosphate backbone and commonly form hydrogen bonds with the base residues, particularly

guanine Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...

, in protein–DNA complexes. Lysine residues can be singly, doubly and even triply methylated. Methylation does ''not'' alter the positive charge on the side chain, however. * acetylation : Acetylation of the lysine amino groups is chemically analogous to the acetylation of the N-terminus. Functionally, however, the acetylation of lysine residues is used to regulate the binding of proteins to nucleic acids. The cancellation of the positive charge on the lysine weakens the electrostatic attraction for the (negatively charged) nucleic acids. * sulfation : Tyrosines may become sulfated on their

\mathrm

atom. Somewhat unusually, this modification occurs in the

Golgi apparatus The Golgi apparatus (), also known as the Golgi complex, Golgi body, or simply the Golgi, is an organelle found in most eukaryotic Cell (biology), cells. Part of the endomembrane system in the cytoplasm, it protein targeting, packages proteins ...

, not in the

endoplasmic reticulum The endoplasmic reticulum (ER) is a part of a transportation system of the eukaryote, eukaryotic cell, and has many other important functions such as protein folding. The word endoplasmic means "within the cytoplasm", and reticulum is Latin for ...

. Similar to phosphorylated tyrosines, sulfated tyrosines are used for specific recognition, e.g., in chemokine receptors on the cell surface. As with phosphorylation, sulfation adds a negative charge to a previously neutral site. * prenylation and palmitoylation

\mathrm

: The hydrophobic isoprene (e.g., farnesyl, geranyl, and geranylgeranyl groups) and palmitoyl groups may be added to the

\mathrm

atom of cysteine residues to anchor proteins to cellular membranes. Unlike the GPI and myritoyl anchors, these groups are not necessarily added at the termini. * carboxylation : A relatively rare modification that adds an extra carboxylate group (and, hence, a double negative charge) to a glutamate side chain, producing a Gla residue. This is used to strengthen the binding to "hard" metal ions such as

calcium Calcium is a chemical element; it has symbol Ca and atomic number 20. As an alkaline earth metal, calcium is a reactive metal that forms a dark oxide-nitride layer when exposed to air. Its physical and chemical properties are most similar to it ...

. * ADP-ribosylation : The large ADP-ribosyl group can be transferred to several types of side chains within proteins, with heterogeneous effects. This modification is a target for the powerful toxins of disparate bacteria, e.g., ''Vibrio cholerae'', ''Corynebacterium diphtheriae'' and ''Bordetella pertussis''. *

ubiquitin Ubiquitin is a small (8.6 kDa) regulatory protein found in most tissues of eukaryotic organisms, i.e., it is found ''ubiquitously''. It was discovered in 1975 by Gideon Goldstein and further characterized throughout the late 1970s and 19 ...

ation and SUMOylation : Various full-length, folded proteins can be attached at their C-termini to the sidechain ammonium groups of lysines of other proteins. Ubiquitin is the most common of these, and usually signals that the ubiquitin-tagged protein should be degraded. Most of the polypeptide modifications listed above occur ''post-translationally'', i.e., after the

has been synthesized on the

, typically occurring in the

, a subcellular

organelle In cell biology, an organelle is a specialized subunit, usually within a cell (biology), cell, that has a specific function. The name ''organelle'' comes from the idea that these structures are parts of cells, as Organ (anatomy), organs are to th ...

of the eukaryotic cell. Many other chemical reactions (e.g., cyanylation) have been applied to proteins by chemists, although they are not found in biological systems.

Cleavage and ligation

In addition to those listed above, the most important modification of primary structure is peptide cleavage (by chemical

hydrolysis Hydrolysis (; ) is any chemical reaction in which a molecule of water breaks one or more chemical bonds. The term is used broadly for substitution reaction, substitution, elimination reaction, elimination, and solvation reactions in which water ...

or by

s). Proteins are often synthesized in an inactive precursor form; typically, an N-terminal or C-terminal segment blocks the active site of the protein, inhibiting its function. The protein is activated by cleaving off the inhibitory peptide. Some proteins even have the power to cleave themselves. Typically, the hydroxyl group of a serine (rarely, threonine) or the thiol group of a cysteine residue will attack the carbonyl carbon of the preceding peptide bond, forming a tetrahedrally bonded intermediate lassified as a hydroxyoxazolidine (Ser/Thr) or hydroxythiazolidine (Cys) intermediate This intermediate tends to revert to the amide form, expelling the attacking group, since the amide form is usually favored by free energy, (presumably due to the strong resonance stabilization of the peptide group). However, additional molecular interactions may render the amide form less stable; the amino group is expelled instead, resulting in an ester (Ser/Thr) or thioester (Cys) bond in place of the peptide bond. This chemical reaction is called an N-O acyl shift. The ester/thioester bond can be resolved in several ways: * Simple hydrolysis will split the polypeptide chain, where the displaced amino group becomes the new N-terminus. This is seen in the maturation of glycosylasparaginase. * A β-elimination reaction also splits the chain, but results in a pyruvoyl group at the new N-terminus. This pyruvoyl group may be used as a covalently attached catalytic cofactor in some enzymes, especially decarboxylases such as S-adenosylmethionine decarboxylase (SAMDC) that exploit the electron-withdrawing power of the pyruvoyl group. * Intramolecular transesterification, resulting in a ''branched'' polypeptide. In inteins, the new ester bond is broken by an intramolecular attack by the soon-to-be C-terminal asparagine. * Intermolecular transesterification can transfer a whole segment from one polypeptide to another, as is seen in the Hedgehog protein autoprocessing.

History

The proposal that proteins were linear chains of α-amino acids was made nearly simultaneously by two scientists at the same conference in 1902, the 74th meeting of the Society of German Scientists and Physicians, held in Karlsbad. Franz Hofmeister made the proposal in the morning, based on his observations of the biuret reaction in proteins. Hofmeister was followed a few hours later by

Emil Fischer Hermann Emil Louis Fischer (; 9 October 1852 – 15 July 1919) was a German chemist and List of Nobel laureates in Chemistry, 1902 recipient of the Nobel Prize in Chemistry. He discovered the Fischer esterification. He also developed the Fisch ...

, who had amassed a wealth of chemical details supporting the peptide-bond model. For completeness, the proposal that proteins contained amide linkages was made as early as 1882 by the French chemist E. Grimaux. Despite these data and later evidence that proteolytically digested proteins yielded only oligopeptides, the idea that proteins were linear, unbranched polymers of amino acids was not accepted immediately. Some scientists such as

William Astbury William Thomas Astbury FRS (25 February 1898 – 4 June 1961) was an English physicist and molecular biologist who made pioneering X-ray diffraction studies of biological molecules. His work on keratin provided the foundation for Linus Pauli ...

doubted that covalent bonds were strong enough to hold such long molecules together; they feared that thermal agitations would shake such long molecules asunder. Hermann Staudinger faced similar prejudices in the 1920s when he argued that

rubber Rubber, also called India rubber, latex, Amazonian rubber, ''caucho'', or ''caoutchouc'', as initially produced, consists of polymers of the organic compound isoprene, with minor impurities of other organic compounds. Types of polyisoprene ...

was composed of

macromolecule A macromolecule is a "molecule of high relative molecular mass, the structure of which essentially comprises the multiple repetition of units derived, actually or conceptually, from molecules of low relative molecular mass." Polymers are physi ...

s. Thus, several alternative hypotheses arose. The colloidal protein hypothesis stated that proteins were colloidal assemblies of smaller molecules. This hypothesis was disproved in the 1920s by ultracentrifugation measurements by

Theodor Svedberg Theodor Svedberg (30 August 1884 – 25 February 1971; also known as The Svedberg) was a Swedish chemist and Nobel laureate for his research on colloids and proteins using the ultracentrifuge. Svedberg was active at Uppsala University from the ...

that showed that proteins had a well-defined, reproducible molecular weight and by electrophoretic measurements by Arne Tiselius that indicated that proteins were single molecules. A second hypothesis, the cyclol hypothesis advanced by Dorothy Wrinch, proposed that the linear polypeptide underwent a chemical cyclol rearrangement C=O + HN

\rightarrow

C(OH)-N that crosslinked its backbone amide groups, forming a two-dimensional ''fabric''. Other primary structures of proteins were proposed by various researchers, such as the diketopiperazine model of Emil Abderhalden and the pyrrol/piperidine model of Troensegaard in 1942. Although never given much credence, these alternative models were finally disproved when Frederick Sanger successfully sequenced

insulin Insulin (, from Latin ''insula'', 'island') is a peptide hormone produced by beta cells of the pancreatic islets encoded in humans by the insulin (''INS)'' gene. It is the main Anabolism, anabolic hormone of the body. It regulates the metabol ...

and by the crystallographic determination of myoglobin and hemoglobin by Max Perutz and John Kendrew.

Relation to secondary and tertiary structure

The primary structure of a biological polymer to a large extent determines the three-dimensional shape (

tertiary structure Protein tertiary structure is the three-dimensional shape of a protein. The tertiary structure will have a single polypeptide chain "backbone" with one or more protein secondary structures, the protein domains. Amino acid side chains and the ...

). Protein sequence can be used to predict local features, such as segments of secondary structure, or trans-membrane regions. However, the complexity of

protein folding Protein folding is the physical process by which a protein, after Protein biosynthesis, synthesis by a ribosome as a linear chain of Amino acid, amino acids, changes from an unstable random coil into a more ordered protein tertiary structure, t ...

currently prohibits predicting the tertiary structure of a protein from its sequence alone. Knowing the structure of a similar homologous sequence (for example a member of the same

protein family A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be ...

) allows highly accurate prediction of the

by homology modeling. If the full-length protein sequence is available, it is possible to estimate its general biophysical properties, such as its isoelectric point.

Notes and references

{{Portal bar, Biology Protein structure 1 Stereochemistry