In
molecular biology
Molecular biology is a branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, biomolecular synthesis, modification, mechanisms, and interactio ...
and
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
, the consensus sequence (or canonical sequence) is the calculated sequence of most frequent residues, either
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
or
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
, found at each position in a
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
. It represents the results of multiple
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
s in which related sequences are compared to each other and similar
sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as
RNA polymerase
In molecular biology, RNA polymerase (abbreviated RNAP or RNApol), or more specifically DNA-directed/dependent RNA polymerase (DdRP), is an enzyme that catalyzes the chemical reactions that synthesize RNA from a DNA template.
Using the e ...
.
[Pierce, Benjamin A. 2002. Genetics : A Conceptual Approach. 1st ed. New York: W.H. Freeman and Co.]
Biological significance
A protein binding site, represented by a consensus sequence, may be a short sequence of
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
s which is found several times in the
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
and is thought to play the same role in its different locations. For example, many
transcription factors recognize particular patterns in the
promoters of the
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s they regulate. In the same way,
restriction enzymes usually have
palindromic consensus sequences, usually corresponding to the site where they cut the DNA.
Transposons act in much the same manner in their identification of target sequences for transposition. Finally,
splice sites (sequences immediately surrounding the
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
-
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...
boundaries) can also be considered as consensus sequences.
Thus a consensus sequence is a model for a putative
DNA binding site: it is obtained by aligning all known examples of a certain recognition site and defined as the idealized sequence that represents the predominant base at each position. All the actual examples shouldn't differ from the consensus by more than a few substitutions, but counting mismatches in this way can lead to inconsistencies.
[
Any mutation allowing a mutated nucleotide in the core promoter sequence to look more like the consensus sequence is known as an up mutation. This kind of mutation will generally make the promoter stronger, and thus the RNA polymerase forms a tighter bind to the DNA it wishes to transcribe and transcription is up-regulated. On the contrary, mutations that destroy conserved nucleotides in the consensus sequence are known as down mutations. These types of mutations down-regulate transcription since RNA polymerase can no longer bind as tightly to the core promoter sequence.
]
Sequence analysis
Developing software for pattern recognition
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
is a major topic in genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinians, Augustinian ...
, molecular biology
Molecular biology is a branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, biomolecular synthesis, modification, mechanisms, and interactio ...
, and bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
. Specific sequence motifs can function as regulatory sequences controlling biosynthesis, or as signal sequences that direct a molecule to a specific site within the cell or regulate its maturation. Since the regulatory function of these sequences is important, they are thought to be conserved across long periods of evolution
Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...
. In some cases, evolutionary relatedness can be estimated by the amount of conservation of these sites.
Notation
The conserved sequence motifs are called consensus sequences and they show which residues are conserved and which residues are variable. Consider the following example DNA sequence:
:A TYR
In this notation
In linguistics and semiotics, a notation system is a system of graphics or symbols, Character_(symbol), characters and abbreviated Expression (language), expressions, used (for example) in Artistic disciplines, artistic and scientific disciplines ...
, A means that an A is always found in that position; Tstands for either C or T; N stands for any base; and means any base except A. Y represents any pyrimidine, and R indicates any purine.
In this example, the notation Tdoes not give any indication of the relative frequency of C or T occurring at that position. And it is not possible to write it as a single consensus sequence e.g. ACNCCA. An alternative method of representing a consensus sequence uses a sequence logo. This is a graphical representation of the consensus sequence, in which the size of a symbol is related to the frequency that a given nucleotide (or amino acid) occurs at a certain position. In sequence logos the more conserved the residue, the larger the symbol for that residue is drawn; the less frequent, the smaller the symbol. Sequence logos can be generated usin
WebLogo
or using th
Gestalt Workbench
a publicly available visualization tool written by Gustavo Glusman at th
Institute for Systems Biology
Software
Bioinformatics tools are able to calculate and visualize consensus sequences. Examples of the tools are JalView and UGENE.
See also
* Position-specific scoring matrix
* Regular expression
A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...
— denoting multiple sequences of symbols in formal language
In logic, mathematics, computer science, and linguistics, a formal language is a set of strings whose symbols are taken from a set called "alphabet".
The alphabet of a formal language consists of symbols that concatenate into strings (also c ...
theory
* Sequence motif
* Sequence logo
References
{{Reflist
Bioinformatics
DNA