The Kozak consensus sequence (Kozak consensus or Kozak sequence) is a
nucleic acid motif that functions as the
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
translation
Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
initiation site in most
eukaryotic
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
transcripts.
Regarded as the optimum sequence for initiating translation in
eukaryote
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
s, the sequence is an integral aspect of protein regulation and overall cellular health as well as having implications in human disease.
It ensures that a protein is correctly translated from the genetic message, mediating ribosome assembly and translation initiation. A wrong start site can result in non-functional proteins. As it has become more studied, expansions of the nucleotide sequence, bases of importance, and notable exceptions have arisen.
The sequence was named after the scientist who discovered it,
Marilyn Kozak. Kozak discovered the sequence through a detailed analysis of DNA genomic sequences.
The Kozak sequence is not to be confused with the
ribosomal binding site (RBS), that being either the
5′ cap of a
messenger RNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.
mRNA is created during the ...
or an
internal ribosome entry site
An internal ribosome entry site, abbreviated IRES, is an RNA element that allows for translation initiation in a cap-independent manner, as part of the greater process of protein synthesis. Initiation of eukaryotic translation nearly always occur ...
(IRES).
Sequence
The Kozak sequence was determined by sequencing of 699 vertebrate mRNAs and verified by
site-directed mutagenesis.
While initially limited to a subset of vertebrates (''i.e.'' human, cow, cat, dog, chicken, guinea pig, hamster, mouse, pig, rabbit, sheep, and ''
Xenopus
''Xenopus'' () (Gk., ξενος, ''xenos'' = strange, πους, ''pous'' = foot, commonly known as the clawed frog) is a genus of highly aquatic frogs native to sub-Saharan Africa. Twenty species are currently described with ...
''), subsequent studies confirmed its conservation in higher eukaryotes generally.
The sequence was defined as 5'-
(gcc)gccRccAUGG
-3' (IUPAC
nucleobase
Nucleotide bases (also nucleobases, nitrogenous bases) are nitrogen-containing biological compounds that form nucleosides, which, in turn, are components of nucleotides, with all of these monomers constituting the basic building blocks of nuc ...
notation summarized here) where:
# The
underlined nucleotides indicate the translation
start codon
The start codon is the first codon of a messenger RNA (mRNA) transcript translated by a ribosome. The start codon always codes for methionine in eukaryotes and archaea and a ''N''-formylmethionine (fMet) in bacteria, mitochondria and plastids.
...
, coding for
Methionine
Methionine (symbol Met or M) () is an essential amino acid in humans.
As the precursor of other non-essential amino acids such as cysteine and taurine, versatile compounds such as SAM-e, and the important antioxidant glutathione, methionine play ...
.
# upper-case letters indicate highly conserved
bases, ''i.e.'' the 'AUGG' sequence is constant or rarely, if ever, changes.
[Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences](_blank)
NC-IUB, 1984.
# 'R' indicates that a
purine
Purine is a heterocyclic aromatic organic compound that consists of two rings (pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which include substituted puri ...
(
adenine
Adenine (, ) (nucleoside#List of nucleosides and corresponding nucleobases, symbol A or Ade) is a purine nucleotide base that is found in DNA, RNA, and Adenosine triphosphate, ATP. Usually a white crystalline subtance. The shape of adenine is ...
or
guanine
Guanine () (symbol G or Gua) is one of the four main nucleotide bases found in the nucleic acids DNA and RNA, the others being adenine, cytosine, and thymine ( uracil in RNA). In DNA, guanine is paired with cytosine. The guanine nucleoside ...
) is always observed at this position (with
adenine
Adenine (, ) (nucleoside#List of nucleosides and corresponding nucleobases, symbol A or Ade) is a purine nucleotide base that is found in DNA, RNA, and Adenosine triphosphate, ATP. Usually a white crystalline subtance. The shape of adenine is ...
being more frequent according to Kozak)
# a lower-case letter denotes the most common
base at a position where the
base can nevertheless vary
# the sequence in parentheses (gcc) is of uncertain significance.
The AUG is the initiation codon encoding a methionine amino acid at the N-terminus of the protein. (Rarely, GUG is used as an initiation codon, but methionine is still the first amino acid as it is the met-tRNA in the initiation complex that binds to the mRNA). Variation within the Kozak sequence alters the "strength" thereof. Kozak sequence strength refers to the favorability of initiation, affecting how much protein is synthesized from a given mRNA.
The A nucleotide of the "AUG" is delineated as +1 in mRNA sequences with the preceding base being labeled as −1, i.e. there is no 0 position. For a 'strong' consensus, the nucleotides at positions +4 (i.e. G in the consensus) and −3 (i.e. either A or G in the consensus) relative to the +1 nucleotide must both match the consensus. An 'adequate' consensus has only 1 of these sites, while a 'weak' consensus has neither. The cc at −1 and −2 are not as conserved, but contribute to the overall strength.
There is also evidence that a G in the -6 position is important in the initiation of translation.
While the +4 and the −3 positions in the Kozak sequence have the greatest relative importance in the establishing a favorable initiation context a CC or AA motif at −2 and −1 were found to be important in the initiation of translation in tobacco and maize plants. Protein synthesis in yeast was found to be highly affected by composition of the Kozak sequence in yeast, with adenine enrichment resulting in higher levels of gene expression. A suboptimal Kozak sequence can allow for PIC to scan past the first AUG site and start initiation at a downstream AUG codon.
Ribosome assembly
The
ribosome
Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...
assembles on the
start codon
The start codon is the first codon of a messenger RNA (mRNA) transcript translated by a ribosome. The start codon always codes for methionine in eukaryotes and archaea and a ''N''-formylmethionine (fMet) in bacteria, mitochondria and plastids.
...
(AUG), located within the Kozak sequence. Prior to translation initiation, scanning is done by the pre-initiation complex (PIC). The PIC consists of the 40S (small ribosomal subunit) bound to the ternary complex,
eIF2
Eukaryotic Initiation Factor 2 (eIF2) is a eukaryotic initiation factor. It is required for most forms of eukaryotic translation initiation. eIF2 mediates the binding of tRNAiMet to the ribosome in a GTP-dependent manner. eIF2 is a heterotrimer ...
-GTP-intiatorMet tRNA (TC) to form the 43S ribosome. Assisted by several other initiation factors (
eIF1 and eIF1A,
eIF5
Eukaryotic translation initiation factor 5 is a protein that in humans is encoded by the ''EIF5'' gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleoti ...
,
eIF3
Eukaryotic initiation factor 3 (eIF3) is a multiprotein complex that functions during the initiation phase of eukaryotic translation. It is essential for most forms of Eukaryotic translation#Cap-dependent initiation, cap-dependent and Eukaryotic ...
,
polyA binding protein) it is recruited to the 5′ end of the mRNA. Eukaryotic mRNA is capped with a
7-methylguanosine
7-Methylguanosine (m7G) is a modified purine nucleoside. It is a methylated version of guanosine and when found in human urine, it may be a biomarker of some types of cancer. In the RNAs, 7-methylguanosine have been used to study and examine t ...
(m7G) nucleotide which can help recruit the PIC to the mRNA and initiate scanning. This recruitment to the m7G 5′ cap is supported by the inability of eukaryotic ribosomes to translate circular mRNA, which has no 5′ end. Once the PIC binds to the mRNA it scans until it reaches the first AUG codon in a Kozak sequence.
This scanning is referred to as the scanning mechanism of initiation.

The scanning mechanism of Initiation starts when the PIC binds the 5′ end of the mRNA. Scanning is stimulated by
Dhx29 and
Ddx3/Ded1 and
eIF4 proteins.
The Dhx29 and Ddx3/Ded1 are DEAD-box helicases that help to unwind any
secondary mRNA structure which could hinder scanning. The scanning of an mRNA continues until the first AUG codon on the mRNA is reached, this is known as the "First AUG Rule".
While exceptions to the "First AUG Rule" exist, most exceptions take place at a second AUG codon that is located 3 to 5 nucleotides downstream from the first AUG, or within 10 nucleotides from the 5′ end of the mRNA. At the AUG codon a Methionine
tRNA anticodon is recognized by mRNA codon. Upon base pairing to the start codon the
eIF5
Eukaryotic translation initiation factor 5 is a protein that in humans is encoded by the ''EIF5'' gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleoti ...
in the PIC helps to hydrolyze a
guanosine triphosphate (GTP) bound to the eIF2. This leads to the a structural rearrangement that commits the PIC to binding to the large ribosomal subunit (60S) and forming the
ribosomal complex (80S). Once the 80S ribosome complex is formed then the elongation phase of translation starts.
The first start codon closest to the 5′ end of the strand is not always recognized if it is not contained in a Kozak-like sequence
Lmx1bis an example of a gene with a weak Kozak consensus sequence.
For initiation of translation from such a site, other features are required in the mRNA sequence in order for the ribosome to recognize the initiation codon. Exceptions to the first AUG rule may occur if it is not contained in a Kozak-like sequence. This is called
leaky scanning and could be a potential way to control translation through initiation.
For initiation of translation from such a site, other features are required in the mRNA sequence in order for the ribosome to recognize the initiation codon.
It is believed that the PIC is stalled at the Kozak sequence by interactions between eIF2 and the −3 and +4 nucleotides in the Kozak position. This stalling allows the start codon and the corresponding anticodon time to form the correct
hydrogen bond
In chemistry, a hydrogen bond (H-bond) is a specific type of molecular interaction that exhibits partial covalent character and cannot be described as a purely electrostatic force. It occurs when a hydrogen (H) atom, Covalent bond, covalently b ...
ing. The Kozak consensus sequence is so common that the similarity of the sequence around the AUG codon to the Kozak Sequence is used as a criterion for finding start codons in eukaryotes.
Differences from bacterial initiation
The scanning mechanism of initiation, which utilizes the Kozak sequence, is found only in eukaryotes and has significant differences from the way bacteria initiate translation. The biggest difference is the existence of the
Shine-Dalgarno (SD) sequence in mRNA for bacteria. The SD sequence is located near the start codon which is in contrast to the Kozak sequence which actually contains the start codon. The Shine Dalgarno sequence allows the
16S subunit of the small ribosome subunit to bind to the AUG (or alternative) start codon immediately. In contrast, scanning along the mRNA results in a more rigorous selection process for the AUG codon than in bacteria. An example of bacterial start codon promiscuity can be seen in the use of the alternate start codons UUG and GUG for some genes.
Archaeal transcripts use a mix of SD sequence, Kozak sequence, and
leaderless initiation. Haloarchaea are known to have a variant of the Kozak consensus sequence in their
Hsp70
The 70 kilodalton heat shock proteins (Hsp70s or DnaK) are a family of conserved ubiquitously expressed heat shock proteins. Proteins with similar structure exist in virtually all living organisms and play crucial roles in the development of can ...
genes.
Mutations and disease
Marilyn Kozak demonstrated, through systematic study of point mutations, that any mutations of a strong consensus sequence in the −3 position or to the +4 position resulted in highly impaired translation initiation both ''in vitro'' and ''in vivo''.

Research has shown that a mutation of G—>C in the −6 position of the β-globin gene (β+45; human) disrupted the haematological and biosynthetic phenotype function. This was the first mutation found in the Kozak sequence and showed a 30% decrease in translational efficiency. It was found in a family from the Southeast Italy and they suffered from
thalassaemia intermedia.
Similar observations were made regarding mutations in the position −5 from the start codon, AUG. Cytosine in this position, as opposed to thymine, showed more efficient translation and increased expression of the platelet adhesion receptor, glycoprotein Ibα in humans.
Mutations to the Kozak sequence can also have drastic effects upon human health; in particular, certain forms of
congenital heart disease
A congenital heart defect (CHD), also known as a congenital heart anomaly, congenital cardiovascular malformation, and congenital heart disease, is a defect in the structure of the heart or great vessels that is present at birth. A congenital he ...
are caused by Kozak sequence mutations in the ''
GATA4'' gene's 5' untranslated region. The ''GATA4'' gene is responsible for gene expression in a wide variety of tissues including the heart. When the guanosine at the -6 position in the Kozak sequence of ''GATA4'' is mutated to a cytosine, a reduction in GATA4 protein levels results, which leads to a decrease in the expression of genes regulated by the GATA4 transcription factor and linked to the development of atrial septal defect.
The ability of the Kozak sequence to optimize translation can result in novel initiation codons in the typically untranslated region of the 5′ (5′ UTR) end of the mRNA transcript. A G to A mutation was described by Bohlen et al. (2017) in a Kozak-like region in the ''
SOX9
Transcription factor SOX-9 is a protein that in humans is encoded by the ''SOX9'' gene.
Function
SOX-9 recognizes the sequence CCTTGAG along with other members of the HMG-box class DNA-binding domain, DNA-binding proteins. It is expressed by ...
'' gene that created a new translation initiation codon in an out-of-frame
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
. The correct initiation codon was located in a region that did not match the Kozak consensus sequence as closely as the surrounding sequence of the new, upstream initiation site did, which resulted in reduced translation efficiency of functional SOX9 protein. The patient in whom this mutation was detected had developed acampomelic
campomelic dysplasia, a developmental disorder that causes skeletal, reproductive and airway issues due to insufficient ''SOX9'' expression.
Variations in the consensus sequence
The Kozak consensus has been variously described as:
65432-+234
(gcc)gccRccAUGG (Kozak 1987)
AGNNAUGN
ANNAUGG
ACCAUGG (Spotts et al., 1997, mentioned in Kozak 2002)
GACACCAUGG (''H. sapiens HBB, HBD'', ''R. norvegicus Hbb'', etc.)
See also
*
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
, the nucleic acid messenger that serves as the middleman in the Central Dogma of Biology
*
Ribosome
Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...
, the molecular machine responsible for protein synthesis
*
Shine–Dalgarno sequence
The Shine–Dalgarno (SD) sequence is, sometimes partially, part of a ribosomal binding site in bacterial and archaeal messenger RNA. It is generally located around 8 bases upstream of the start codon AUG. The RNA sequence helps recruit the ribos ...
, the ribosomal binding site of
prokaryote
A prokaryote (; less commonly spelled procaryote) is a unicellular organism, single-celled organism whose cell (biology), cell lacks a cell nucleus, nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Gree ...
s.
*
Translation
Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
, the process of peptide synthesis
References
Further reading
*
*
*
{{GeneticTranslation
Protein biosynthesis