In
molecular biology
Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and phys ...
and
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecul ...
or
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha ...
, found at each position in a
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
. It serves as a simplified representation of the population.
It represents the results of multiple
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
s in which related sequences are compared to each other and similar
sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as
RNA polymerase
In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that synthesizes RNA from a DNA template.
Using the enzyme helicase, RNAP locally opens the ...
.
[Pierce, Benjamin A. 2002. Genetics : A Conceptual Approach. 1st ed. New York: W.H. Freeman and Co.]
Biological significance
A protein binding site, represented by a consensus sequence, may be a short sequence of
nucleotide
Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecul ...
s which is found several times in the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
and is thought to play the same role in its different locations. For example, many
transcription factors
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The f ...
recognize particular patterns in the
promoters of the
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
s they regulate. In the same way,
restriction enzymes
A restriction enzyme, restriction endonuclease, REase, ENase or'' restrictase '' is an enzyme that cleaves DNA into fragments at or near specific recognition sites within molecules known as restriction sites. Restriction enzymes are one class o ...
usually have
palindromic
A palindrome is a word, number, phrase, or other sequence of symbols that reads the same backwards as forwards, such as the words ''madam'' or ''racecar'', the date and time ''11/11/11 11:11,'' and the sentence: "A man, a plan, a canal – Panam ...
consensus sequences, usually corresponding to the site where they cut the DNA.
Transposons
A transposable element (TE, transposon, or jumping gene) is a nucleic acid sequence in DNA that can change its position within a genome, sometimes creating or reversing mutations and altering the cell's genetic identity and genome size. Trans ...
act in much the same manner in their identification of target sequences for transposition. Finally,
splice site
RNA splicing is a process in molecular biology where a newly-made precursor messenger RNA (pre-mRNA) transcript is transformed into a mature messenger RNA (mRNA). It works by removing all the introns (non-coding regions of RNA) and ''splicing'' b ...
s (sequences immediately surrounding the
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
-
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...
boundaries) can also be considered as consensus sequences.
Thus a consensus sequence is a model for a putative
DNA binding site
DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence (e.g. a genome) and (2) they are bound by DNA-binding ...
: it is obtained by aligning all known examples of a certain recognition site and defined as the idealized sequence that represents the predominant base at each position. All the actual examples shouldn't differ from the consensus by more than a few substitutions, but counting mismatches in this way can lead to inconsistencies.
Any mutation allowing a mutated nucleotide in the core promoter sequence to look more like the consensus sequence is known as an up mutation. This kind of mutation will generally make the promoter stronger, and thus the RNA polymerase forms a tighter bind to the DNA it wishes to transcribe and transcription is up-regulated. On the contrary, mutations that destroy conserved nucleotides in the consensus sequence are known as down mutations. These types of mutations down-regulate transcription since RNA polymerase can no longer bind as tightly to the core promoter sequence.
Sequence analysis
Developing software for
pattern recognition
Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphic ...
is a major topic in
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinian friar worki ...
,
molecular biology
Molecular biology is the branch of biology that seeks to understand the molecular basis of biological activity in and between cells, including biomolecular synthesis, modification, mechanisms, and interactions. The study of chemical and phys ...
, and
bioinformatics
Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...
. Specific
sequence motif
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as '' ...
s can function as
regulatory sequence
A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and vi ...
s controlling biosynthesis, or as
signal sequences that direct a molecule to a specific site within the cell or regulate its maturation. Since the regulatory function of these sequences is important, they are thought to be conserved across long periods of
evolution
Evolution is change in the heritable characteristics of biological populations over successive generations. These characteristics are the expressions of genes, which are passed on from parent to offspring during reproduction. Variation ...
. In some cases, evolutionary relatedness can be estimated by the amount of conservation of these sites.
Notation
The conserved sequence motifs are called consensus sequences and they show which residues are conserved and which residues are variable. Consider the following example
DNA sequence:
:A
TYR
In this
notation
In linguistics and semiotics, a notation is a system of graphics or symbols, characters and abbreviated expressions, used (for example) in artistic and scientific disciplines to represent technical facts and quantities by convention. Therefore, ...
, A means that an A is always found in that position;
Tstands for either C or T; N stands for any base; and means any base except A. Y represents any
pyrimidine
Pyrimidine (; ) is an aromatic, heterocyclic, organic compound similar to pyridine (). One of the three diazines (six-membered heterocyclics with two nitrogen atoms in the ring), it has nitrogen atoms at positions 1 and 3 in the ring. The othe ...
, and R indicates any
purine
Purine is a heterocyclic aromatic organic compound that consists of two rings ( pyrimidine and imidazole) fused together. It is water-soluble. Purine also gives its name to the wider class of molecules, purines, which include substituted purin ...
.
In this example, the notation
Tdoes not give any indication of the relative frequency of C or T occurring at that position. An alternative method of representing a consensus sequence uses a
sequence logo
In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences).
A sequence logo is created from a collection of aligned sequences and dep ...
. This is a graphical representation of the consensus sequence, in which the size of a symbol is related to the frequency that a given nucleotide (or amino acid) occurs at a certain position. In sequence logos the more conserved the residue, the larger the symbol for that residue is drawn; the less frequent, the smaller the symbol. Sequence logos can be generated usin
WebLogo or using th
Gestalt Workbench a publicly available visualization tool written by Gustavo Glusman at th
Institute for Systems Biology
Software
Bioinformatics tools are able to calculate and visualize consensus sequences. Examples of the tools are
JalView and
UGENE
UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.
UGENE helps biolo ...
.
See also
*
Position-specific scoring matrix
A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.
PWMs are often derived from a set ...
*
Regular expression
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
— denoting multiple sequences of symbols in
formal language
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.
The alphabet of a formal language consists of s ...
theory
*
Sequence motif
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as '' ...
*
Sequence logo
In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences).
A sequence logo is created from a collection of aligned sequences and dep ...
References
{{Reflist
Bioinformatics
DNA