
Chargaff's rules (given by
Erwin Chargaff
Erwin Chargaff (11 August 1905 – 20 June 2002) was an Austro-Hungarian-born American biochemist, writer, and professor of biochemistry at Columbia University medical school. A Bucovinian Jew who immigrated to the United States during the Nazi ...
) state that in the
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
of any species and any organism, the amount of
guanine
Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...
should be equal to the amount of
cytosine
Cytosine () (symbol C or Cyt) is one of the four nucleotide bases found in DNA and RNA, along with adenine, guanine, and thymine ( uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attac ...
and the amount of
adenine
Adenine (, ) (nucleoside#List of nucleosides and corresponding nucleobases, symbol A or Ade) is a purine nucleotide base that is found in DNA, RNA, and Adenosine triphosphate, ATP. Usually a white crystalline subtance. The shape of adenine is ...
should be equal to the amount of
thymine
Thymine () (symbol T or Thy) is one of the four nucleotide bases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine ...
. Further, a 1:1
stoichiometric
Stoichiometry () is the relationships between the masses of reactants and products before, during, and following chemical reactions.
Stoichiometry is based on the law of conservation of mass; the total mass of reactants must equal the total m ...
ratio of
purine
Purine is a heterocyclic aromatic organic compound that consists of two rings (pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which include substituted puri ...
and
pyrimidine
Pyrimidine (; ) is an aromatic, heterocyclic, organic compound similar to pyridine (). One of the three diazines (six-membered heterocyclics with two nitrogen atoms in the ring), it has nitrogen atoms at positions 1 and 3 in the ring. The oth ...
bases (i.e.,
A+G=T+C
) should exist. This pattern is found in both strands of the DNA. They were discovered by Austrian-born chemist
Erwin Chargaff
Erwin Chargaff (11 August 1905 – 20 June 2002) was an Austro-Hungarian-born American biochemist, writer, and professor of biochemistry at Columbia University medical school. A Bucovinian Jew who immigrated to the United States during the Nazi ...
in the late 1940s.
Definitions
First parity rule
The first rule holds that a double-stranded
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
molecule, ''globally'' has percentage base pair equality: A% = T% and G% = C%. The rigorous validation of the rule constitutes the basis of
Watson–Crick base pairs in the DNA double helix model.
Second parity rule
The second rule holds that both Α% ≈ Τ% and G% ≈ C% are valid for each of the two DNA strands.
This describes only a global feature of the base composition in a single DNA strand.
Research
The second parity rule was discovered in 1968.
It states that, in single-stranded DNA, the number of adenine units is ''approximately'' equal to that of thymine (%A
≈ %T), and the number of cytosine units is ''approximately'' equal to that of guanine (%C
≈ %G).
In 2006, it was shown that this rule applies to four
of the five types of double stranded genomes; specifically it applies to the
eukaryotic
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
chromosomes
A chromosome is a package of DNA containing part or all of the genetic material of an organism. In most chromosomes, the very long thin DNA fibers are coated with nucleosome-forming packaging proteins; in eukaryotic cells, the most importa ...
, the
bacteria
Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...
l chromosomes, the double stranded
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
viral genomes, and the
archaea
Archaea ( ) is a Domain (biology), domain of organisms. Traditionally, Archaea only included its Prokaryote, prokaryotic members, but this has since been found to be paraphyletic, as eukaryotes are known to have evolved from archaea. Even thou ...
l chromosomes. It does not apply to
organellar genomes (
mitochondria
A mitochondrion () is an organelle found in the cells of most eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is us ...
and
plastid
A plastid is a membrane-bound organelle found in the Cell (biology), cells of plants, algae, and some other eukaryotic organisms. Plastids are considered to be intracellular endosymbiotic cyanobacteria.
Examples of plastids include chloroplasts ...
s) smaller than ~20–30
kbp, nor does it apply to single stranded DNA (viral) genomes or any type of
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
genome. The basis for this rule is still under investigation, although genome size may play a role.

The rule itself has consequences. In most bacterial genomes (which are generally 80–90%
coding) genes are arranged in such a fashion that approximately 50% of the coding sequence lies on either strand.
Wacław Szybalski, in the 1960s, showed that in
bacteriophage
A bacteriophage (), also known informally as a phage (), is a virus that infects and replicates within bacteria. The term is derived . Bacteriophages are composed of proteins that Capsid, encapsulate a DNA or RNA genome, and may have structu ...
coding sequences
purines
Purine is a heterocyclic compound, heterocyclic aromatic organic compound that consists of two rings (pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which inc ...
(A and G) exceed
pyrimidines (C and T).
This rule has since been confirmed in other organisms and should probably be now termed "
Szybalski's rule". While Szybalski's rule generally holds, exceptions are known to exist.
The biological basis for Szybalski's rule is not yet known.
The combined effect of Chargaff's second rule and Szybalski's rule can be seen in bacterial genomes where the coding sequences are not equally distributed. The
genetic code
Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...
has 64
codons
Genetic code is a set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links pro ...
of which 3 function as termination codons: there are only 20
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
s normally present in proteins. (There are two uncommon amino acids—
selenocysteine
Selenocysteine (symbol Sec or U, in older publications also as Se-Cys) is the 21st proteinogenic amino acid. Selenoproteins contain selenocysteine residues. Selenocysteine is an analogue of the more common cysteine with selenium in place of the ...
and
pyrrolysine
Pyrrolysine (symbol Pyl or O), encoded by the 'amber' stop codon UAG, is a proteinogenic amino acid that is used in some methanogenic archaea and in bacteria. It consists of lysine with a 4-methylpyrroline-5-carboxylate in amide linkage with the ...
—found in a limited number of proteins and encoded by the
stop codon
In molecular biology, a stop codon (or termination codon) is a codon (nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the additio ...
s—TGA and TAG respectively.) The mismatch between the number of codons and amino acids allows several codons to code for a single amino acid—such codons normally differ only at the third codon base position.
Multivariate statistical analysis of codon use within genomes with unequal quantities of coding sequences on the two strands has shown that codon use in the third position depends on the strand on which the gene is located. This seems likely to be the result of Szybalski's and Chargaff's rules. Because of the asymmetry in pyrimidine and purine use in coding sequences, the strand with the greater coding content will tend to have the greater number of purine bases (Szybalski's rule). Because the number of purine bases will, to a very good approximation, equal the number of their complementary pyrimidines within the same strand and, because the coding sequences occupy 80–90% of the strand, there appears to be (1) a selective pressure on the third base to minimize the number of purine bases in the strand with the greater coding content; and (2) that this pressure is proportional to the mismatch in the length of the coding sequences between the two strands.

The origin of the deviation from Chargaff's rule in the organelles has been suggested to be a consequence of the mechanism of replication.
During replication the DNA strands separate. In single stranded DNA,
cytosine
Cytosine () (symbol C or Cyt) is one of the four nucleotide bases found in DNA and RNA, along with adenine, guanine, and thymine ( uracil in RNA). It is a pyrimidine derivative, with a heterocyclic aromatic ring and two substituents attac ...
spontaneously slowly deaminates to
adenosine
Adenosine (symbol A) is an organic compound that occurs widely in nature in the form of diverse derivatives. The molecule consists of an adenine attached to a ribose via a β-N9- glycosidic bond. Adenosine is one of the four nucleoside build ...
(a C to A
transversion). The longer the strands are separated the greater the quantity of deamination. For reasons that are not yet clear the strands tend to exist longer in single form in mitochondria than in chromosomal DNA. This process tends to yield one strand that is enriched in
guanine
Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...
(G) and
thymine
Thymine () (symbol T or Thy) is one of the four nucleotide bases in the nucleic acid of DNA that are represented by the letters G–C–A–T. The others are adenine, guanine, and cytosine. Thymine is also known as 5-methyluracil, a pyrimidine ...
(T) with its complement enriched in cytosine (C) and adenosine (A), and this process may have given rise to the deviations found in the mitochondria.
Chargaff's second rule appears to be the consequence of a more complex parity rule: within a single strand of DNA any oligonucleotide (
k-mer or
n-gram; length ≤ 10) is present in equal numbers to its reverse complementary nucleotide. Because of the computational requirements this has not been verified in all genomes for all oligonucleotides. It has been verified for triplet oligonucleotides for a large data set.
Albrecht-Buehler has suggested that this rule is the consequence of genomes evolving by a process of
inversion and
transposition.
This process does not appear to have acted on the mitochondrial genomes. Chargaff's second parity rule appears to be extended from the nucleotide-level to populations of codon triplets, in the case of whole single-stranded Human genome DNA.
A kind of "codon-level second Chargaff's parity rule" is proposed as follows:
, + Intra-strand relation among percentages of codon populations
! scope=col , First codon !! scope=col , Second codon !! scope=col , Relation proposed !! scope=col , Details
, -
,
Twx
(1st base position is T) , ,
yzA
(3rd base position is A) , , %
Twx
%
yzA
, ,
Twx
and
yzA
are mirror codons, e.g.
TCG
and
CGA
, -
,
Cwx
(1st base position is C) , ,
yzG
(3rd base position is G) , , %
Cwx
%
yzG
, ,
Cwx
and
yzG
are mirror codons, e.g.
CTA
and
TAG
, -
,
wTx
(2nd base position is T) , ,
yAz
(2nd base position is A) , , %
wTx
%
yAz
, ,
wTx
and
yAz
are mirror codons, e.g.
CTG
and
CAG
, -
,
wCx
(2nd base position is C) , ,
yGz
(2nd base position is G) , , %
wCx
%
yGz
, ,
wCx
and
yGz
are mirror codons, e.g.
TCT
and
AGA
, -
,
wxT
(3rd base position is T) , ,
Ayz
(1st base position is A) , , %
wxT
%
Ayz
, ,
wxT
and
Ayz
are mirror codons, e.g.
CTT
and
AAG
, -
,
wxC
(3rd base position is C) , ,
Gyz
(1st base position is G) , , %
wxC
%
Gyz
, ,
wxC
and
Gyz
are mirror codons, e.g.
GGC
and
GCC
, -
Examples — computing whole human genome using the first codons reading frame provides:
36530115 TTT and 36381293 AAA (ratio % = 1.00409). 2087242 TCG and 2085226 CGA (ratio % = 1.00096), etc...
In 2020, it is suggested that the physical properties of the dsDNA (double stranded DNA) and the tendency to maximum entropy of all the physical systems are the cause of Chargaff's second parity rule.
The symmetries and patterns present in the dsDNA sequences can emerge from the physical peculiarities of the dsDNA molecule and the maximum entropy principle alone, rather than from biological or environmental evolutionary pressure.
Percentages of bases in DNA
The following table is a representative sample of Erwin Chargaff's 1952 data, listing the base composition of DNA from various organisms and support both of Chargaff's rules.
An organism such as φX174 with significant variation from A/T and G/C equal to one, is indicative of single stranded DNA.
! scope=col, Organism!!scope=col, Taxon!!scope=col, %A !!scope=col, %G !!scope=col, %C !!scope=col, %T !!scope=col, A / T !!scope=col, G / C !!scope=col, %GC !!scope=col, %AT
, -
,
Maize
Maize (; ''Zea mays''), also known as corn in North American English, is a tall stout grass that produces cereal grain. It was domesticated by indigenous peoples in southern Mexico about 9,000 years ago from wild teosinte. Native American ...
, , ''
Zea'' , , 26.8 , , 22.8 , , 23.2 , , 27.2 , , 0.99 , , 0.98 , , 46.1 , , 54.0
, -
,
Octopus
An octopus (: octopuses or octopodes) is a soft-bodied, eight-limbed mollusc of the order Octopoda (, ). The order consists of some 300 species and is grouped within the class Cephalopoda with squids, cuttlefish, and nautiloids. Like oth ...
, , ''
Octopus
An octopus (: octopuses or octopodes) is a soft-bodied, eight-limbed mollusc of the order Octopoda (, ). The order consists of some 300 species and is grouped within the class Cephalopoda with squids, cuttlefish, and nautiloids. Like oth ...
'' , , 33.2 , , 17.6 , , 17.6 , , 31.6 , , 1.05 , , 1.00 , , 35.2 , , 64.8
, -
,
Chicken
The chicken (''Gallus gallus domesticus'') is a domesticated subspecies of the red junglefowl (''Gallus gallus''), originally native to Southeast Asia. It was first domesticated around 8,000 years ago and is now one of the most common and w ...
, , ''
Gallus'' , , 28.0 , , 22.0 , , 21.6 , , 28.4 , , 0.99 , , 1.02 , , 43.7 , , 56.4
, -
,
Rat
Rats are various medium-sized, long-tailed rodents. Species of rats are found throughout the order Rodentia, but stereotypical rats are found in the genus ''Rattus''. Other rat genera include '' Neotoma'' (pack rats), '' Bandicota'' (bandicoo ...
, , ''
Rattus
''Rattus'' is a genus of muroid rodents, all typically called rats. However, the term rat can also be applied to rodent species outside of this genus.
Species and description
The best-known ''Rattus'' species are the black rat (''R. rattus'') ...
'' , , 28.6 , , 21.4 , , 20.5 , , 28.4 , , 1.01 , , 1.00 , , 42.9 , , 57.0
, -
,
Human
Humans (''Homo sapiens'') or modern humans are the most common and widespread species of primate, and the last surviving species of the genus ''Homo''. They are Hominidae, great apes characterized by their Prehistory of nakedness and clothing ...
, , ''
Homo
''Homo'' () is a genus of great ape (family Hominidae) that emerged from the genus ''Australopithecus'' and encompasses only a single extant species, ''Homo sapiens'' (modern humans), along with a number of extinct species (collectively called ...
'' , , 29.3 , , 20.7 , , 20.0 , , 30.0 , , 0.98 , , 1.04 , , 40.7 , , 59.3
, -
,
Grasshopper
Grasshoppers are a group of insects belonging to the suborder Caelifera. They are amongst what are possibly the most ancient living groups of chewing herbivorous insects, dating back to the early Triassic around 250 million years ago.
Grassh ...
, ,
Orthoptera
Orthoptera () is an order of insects that comprises the grasshoppers, locusts, and crickets, including closely related insects, such as the bush crickets or katydids and wētā. The order is subdivided into two suborders: Caelifera – gras ...
, , 29.3 , , 20.5 , , 20.7 , , 29.3 , , 1.00 , , 0.99 , , 41.2 , , 58.6
, -
,
Sea urchin
Sea urchins or urchins () are echinoderms in the class (biology), class Echinoidea. About 950 species live on the seabed, inhabiting all oceans and depth zones from the intertidal zone to deep seas of . They typically have a globular body cove ...
, ,
Echinoidea
Sea urchins or urchins () are echinoderms in the class Echinoidea. About 950 species live on the seabed, inhabiting all oceans and depth zones from the intertidal zone to deep seas of . They typically have a globular body covered by a spiny ...
, , 32.8 , , 17.7 , , 17.3 , , 32.1 , , 1.02 , , 1.02 , , 35.0 , , 64.9
, -
,
Wheat
Wheat is a group of wild and crop domestication, domesticated Poaceae, grasses of the genus ''Triticum'' (). They are Agriculture, cultivated for their cereal grains, which are staple foods around the world. Well-known Taxonomy of wheat, whe ...
, , ''
Triticum
Wheat is a group of wild and domesticated grasses of the genus ''Triticum'' (). They are cultivated for their cereal grains, which are staple foods around the world. Well-known wheat species and hybrids include the most widely grown comm ...
'' , , 27.3 , , 22.7 , , 22.8 , , 27.1 , , 1.01 , , 1.00 , , 45.5 , , 54.4
, -
,
Yeast
Yeasts are eukaryotic, single-celled microorganisms classified as members of the fungus kingdom (biology), kingdom. The first yeast originated hundreds of millions of years ago, and at least 1,500 species are currently recognized. They are est ...
, , ''
Saccharomyces'' , , 31.3 , , 18.7 , , 17.1 , , 32.9 , , 0.95 , , 1.09 , , 35.8 , , 64.4
, -
, ''
E. coli'' , , ''
Escherichia
''Escherichia'' ( ) is a genus of Gram-negative, non-Endospore, spore-forming, Facultative anaerobic organism, facultatively anaerobic, rod-shaped bacteria from the family Enterobacteriaceae. In those species which are inhabitants of the gastroin ...
'' , , 24.7 , , 26.0 , , 25.7 , , 23.6 , , 1.05 , , 1.01 , , 51.7 , , 48.3
, -
,
φX174 , , ''
PhiX174'' , , 24.0 , , 23.3 , , 21.5 , , 31.2 , , 0.77 , , 1.08 , , 44.8 , , 55.2
, -
See also
*
Genetic codes
References
Further reading
*
*
*
*
*
External links
CBS Genome Atlas Database — contains hundreds of examples of base skews and had problems.
The Z curve database of genomes— a 3-dimensional visualization and analysis tool of genomes.
[{{cite journal , vauthors=Zhang CT, Zhang R, Ou HY , year=2003 , title=The Z curve database: a graphic representation of genome sequences , journal=Bioinformatics , volume=19 , issue=5 , pages=593–599 , doi=10.1093/bioinformatics/btg041 , pmid=12651717, doi-access=free ]
DNA
Genetics techniques
History of genetics
Biotechnology
Medical research
Biology experiments
Laboratory techniques
Molecular biology