
In
molecular biology
Molecular biology is a branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, biomolecular synthesis, modification, mechanisms, and interactio ...
and
genetics
Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinians, Augustinian ...
, DNA annotation or genome annotation is the process of describing the structure and function of the components of a
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
,
by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate.
Among other things, it identifies the locations of
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s and all the
coding region
The coding region of a gene, also known as the coding DNA sequence (CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared ...
s in a genome and determines what those genes do.
Annotation is performed after a genome is
sequenced
In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succi ...
and
assembled, and is a necessary step in genome analysis before the sequence is deposited in a
database
In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
and described in a published article. Although describing individual genes and their products or functions is sufficient to consider this description as an annotation, the depth of analysis reported in literature for different genomes vary widely, with some reports including additional information that goes beyond a simple annotation.
Furthermore, due to the size and complexity of sequenced genomes, DNA annotation is not performed manually, but is instead automated by computational means. However, the conclusions drawn from the obtained results require manual expert analysis.
DNA annotation is classified into two categories: ''structural annotation'', which identifies and demarcates elements in a genome, and ''functional annotation'', which assigns functions to these elements.
This is not the only way in which it has been categorized, as several alternatives, such as dimension-based
and level-based classifications,
have also been proposed.
History
The first generation of genome annotators used local ''
ab initio
( ) is a Latin term meaning "from the beginning" and is derived from the Latin ("from") + , ablative singular of ("beginning").
Etymology
, from Latin, literally "from the beginning", from ablative case of "entrance", "beginning", related t ...
'' methods, which are based solely on the information that can be extracted from the DNA sequence on a local scale, that is, one
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
(ORF) at a time.
They appeared as a necessity to handle the enormous amount of data produced by the
Maxam-Gilbert and
Sanger DNA sequencing
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...
techniques developed in the late 1970s. The first software used to analyze sequencing
reads is the
Staden Package, created by Rodger Staden in 1977.
It performed several tasks related to annotation, such as
base and
codon
Genetic code is a set of rules used by living cells to translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links prote ...
counts. In fact, codon usage was the main strategy used by several early
protein coding sequence (CDS) prediction methods,
based on the assumption that the most
translated regions in a genome contain codons with the most abundant corresponding
tRNA
Transfer ribonucleic acid (tRNA), formerly referred to as soluble ribonucleic acid (sRNA), is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes). In a cell, it provides the physical link between the gene ...
s (the molecules responsible for carrying
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
s to the
ribosome
Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...
during protein synthesis) allowing a more efficient translation.
This was also known to be the case for
synonymous codons, which are often present in proteins expressed at a lower level.
The advent of complete genomes in the 1990s (the first one being the genome of ''
Haemophilus influenzae
''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, Motility, non-motile, Coccobacillus, coccobacillary, facultative anaerobic organism, facultatively anaerobic, Capnophile, capnophili ...
'' sequenced in 1995) introduced a second generation of annotators. Just like in the previous generation, they performed annotation through ''ab initio'' methods, but now applied on a genome-wide scale.
Markov model
In probability theory, a Markov model is a stochastic model used to Mathematical model, model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, ...
s are the driving force behind many algorithms used within annotators of this generation;
these models can be thought of as
directed graphs where nodes represent different genomic signals (such as
transcription and
translation
Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
start sites) connected by arrows representing the scanning of the sequence. To ensure a Markov model detects a genomic signal, it must first be trained on a series of known genomic signals.
The output of Markov models in the context of annotation includes the probabilities of every kind of genomic element in every single part of the genome, and an accurate Markov model will assign high probabilities to correct annotations and low probabilities to the incorrect ones.

As more sequenced genomes began to be available in early and mid 2000s, coupled with the numerous protein sequences that were obtained experimentally, genome annotators began employing homology based methods, launching the third generation of genome annotation. These new methods allowed annotators not only to infer genomic elements through statistical means (as in previous generations) but could also perform their task by comparing the sequence being annotated with other already existing and validated sequences. These so-called combiner annotators, which perform both ''ab initio'' and homology-based annotation, require fast
alignment
Alignment may refer to:
Archaeology
* Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks
* Stone alignment, a linear arrangement of upright, parallel megalithic standing stones
Biology
* Struc ...
algorithms to identify regions of
homology.
In the late 2000s, genome annotation shifted its attention towards identifying
non-coding region
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...
s in DNA, which was achieved thanks to the appearance of methods to analyze
transcription factor binding sites,
DNA methylation
DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter (genetics), promoter, DNA methylati ...
sites,
chromatin
Chromatin is a complex of DNA and protein found in eukaryote, eukaryotic cells. The primary function is to package long DNA molecules into more compact, denser structures. This prevents the strands from becoming tangled and also plays important r ...
structure, and other
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
and
regulatory region analysis techniques. Other genome annotators also began to focus on population-level studies represented by the
pangenome; by doing so, for instance, annotation pipelines ensure that core genes of a
clade
In biology, a clade (), also known as a Monophyly, monophyletic group or natural group, is a group of organisms that is composed of a common ancestor and all of its descendants. Clades are the fundamental unit of cladistics, a modern approach t ...
are also found in new genomes of the same clade. Both annotation strategies constitute the fourth generation of genome annotators.
By the 2010s, the genome sequences of more than a thousand-human individuals (through the
1000 Genomes Project
The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least o ...
) and several
model organisms
A model organism is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workings of other organisms. Mo ...
became available. As such, genome annotation remains a major challenge for scientists investigating the
human
Humans (''Homo sapiens'') or modern humans are the most common and widespread species of primate, and the last surviving species of the genus ''Homo''. They are Hominidae, great apes characterized by their Prehistory of nakedness and clothing ...
and other genomes.
[ ]
Structural annotation

Structural annotation describes the precise location of the different elements in a genome, such as
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
s (ORFs),
coding sequences (CDS),
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
s,
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...
s,
repeats,
splice sites,
regulatory motifs,
start
Start can refer to multiple topics:
* Takeoff, the phase of flight where an aircraft transitions from moving along the ground to flying through the air
* Starting lineup in sports
* Track and field#Starts use in race, Starts use in sport race
* S ...
and
stop codons
Genetic code is a set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links pro ...
, and
promoters.
The main steps of structural annotation are:
# Repeat identification and masking.
# Evidence alignment (optional).
# Splice identification (only in eukaryotes).
# Feature prediction (coding and noncoding sequences).
Repeat identification and masking
The first step of structural annotation consists in the identification and masking of
repeats, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and
transposon
A transposable element (TE), also transposon, or jumping gene, is a type of mobile genetic element, a nucleic acid sequence in DNA that can change its position within a genome.
The discovery of mobile genetic elements earned Barbara McClinto ...
s (which are larger elements with several copies across the genome).
Repeats are a major component of both prokaryotic and eukaryotic genomes; for instance, between 0% and over 42% of prokaryotic genomes consist of repeats
and three quarters of the
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 23 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual Mitochondrial DNA, mitochondria. These ar ...
are composed of repetitive elements.
Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly-defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods:
* ''De novo'' methods. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of
sequence conservation
In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids (DNA and RNA) or proteins across species ( orthologous sequences), or within a genome ( paralogous sequences), or between donor and receptor taxa ( ...
in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved
coding sequences (CDS), making careful post-processing an indispensable step to remove these sequences. It may also leave out related regions that have degraded over time and may group elements that have no connection in their evolutionary history.
* Homology-based methods. Repeats are identified by similarity (
homology) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with ''de novo'' methods, but are biased towards previously identified families.
* Structure-based methods. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable.
* Comparative genomic methods. Repeats are identified as disruptions of one or more sequences in a
multiple sequence alignment
Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis an ...
produced by large
insertion regions. Although this strategy avoids the poorly-defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question.
After the repetitive regions in a genome have been identified, they are masked. ''Masking'' means replacing the letters of the
nucleotide
Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...
s (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
(ORF) in a transposon as an
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
)
Depending on the letters used for replacement, masking can be classified as soft or hard: in ''soft masking'', repetitive regions are indicated with lowercase letters (a, c, g, or t), whereas in ''hard masking'', the letters of these regions are replaced with N's. This way, for example, soft masking can be used to exclude word matches and avoid initiating an
alignment
Alignment may refer to:
Archaeology
* Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks
* Stone alignment, a linear arrangement of upright, parallel megalithic standing stones
Biology
* Struc ...
in those regions, and hard masking, apart from all of this, can also exclude masked regions from alignment scores.
Evidence alignment
The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known
expressed sequence tag
In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has pro ...
s (ESTs),
RNA
Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...
s and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s of the organism being annotated with the genome.
Although it is optional, it can improve gene sequence elucidation because RNAs and proteins are direct products of coding sequences.
If
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...
data is available, it may be used to annotate and quantify all of the genes and their
isoforms
A protein isoform, or "protein variant", is a member of a set of highly similar proteins that originate from a single gene and are the result of genetic differences. While many perform the same or similar biological roles, some isoforms have uniqu ...
located in the corresponding genome, providing not only their locations, but also their rates of expression.
However, transcripts provide insufficient information for gene prediction because they might be unobtainable from some genes, they may encode
operon
In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo splic ...
s of more than one gene, and their start and stop codons cannot be determined due to
frameshifts and
translation initiation factors.
To solve this problem,
proteogenomics based approaches are employed, which utilize information from expressed proteins often derived from
mass spectrometry
Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a ''mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is used ...
.
Splice identification
Annotation of
eukaryotic
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
genomes has an extra layer of difficulty due to
RNA splicing
RNA splicing is a process in molecular biology where a newly-made precursor messenger RNA (pre-mRNA) transcription (biology), transcript is transformed into a mature messenger RNA (Messenger RNA, mRNA). It works by removing all the introns (non-cod ...
, a
post-transcriptional process in which
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...
s (non-coding regions) are removed and
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
s (coding regions) are joined.
Therefore, eukaryotic
coding sequences (CDS) are discontinuous, and, to ensure their proper identification, intronic regions must be filtered. To do so, annotation pipelines must find the exon-intron boundaries, and multiple methodologies have been developed for this purpose. One solution is to use known exon boundaries for alignment; for instance, many introns begin with GT and end with AG.
This approach, however, cannot detect novel boundaries, so alternatives like
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithms exist that are trained on known exon boundaries and
quality information to predict new ones.
Predictors of new exon boundaries usually require efficient data-compression and alignment algorithms, but they are prone to failure in boundaries located in regions with low
sequence coverage or high error-rates produced during sequencing.
Feature prediction
A genome is divided in
coding and
noncoding regions, and the last step of structural annotation consists in identifying these features within the genome. In fact, the primary task in genome annotation is
gene prediction
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functio ...
, which is why numerous methods have been developed for this purpose.
Gene prediction is a misleading term, as most gene predictors only identify
coding sequences (CDS) and do not report
untranslated region
In molecular genetics, an untranslated region (or UTR) refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the Directionality (molecular biology), 5' side, it is called the Five prime ...
s (UTRs); for this reason, CDS prediction has been proposed as a more accurate term.
CDS predictors detect genome features through methods called ''sensors'', which include ''signal sensors'' that identify functional site signals such as
promoters and
polyA sites, and ''content sensors'' that classify DNA sequences into coding and noncoding content.
Whereas
prokaryotic
A prokaryote (; less commonly spelled procaryote) is a single-celled organism whose cell lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Greek (), meaning 'before', and (), meaning 'nut' ...
CDS predictors mostly deal with
open reading frames (ORFs), which are segments of DNA between the
start
Start can refer to multiple topics:
* Takeoff, the phase of flight where an aircraft transitions from moving along the ground to flying through the air
* Starting lineup in sports
* Track and field#Starts use in race, Starts use in sport race
* S ...
and
stop codons
Genetic code is a set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links pro ...
,
eukaryotic
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes.
CDS prediction methods can be classified into three broad categories:
* ''Ab initio'' methods (also called statistical, intrinsic, or de novo). CDS prediction is based solely on the information that can be extracted from the DNA sequence. They rely on statistical methods such as the
hidden Markov model
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
(HMM). Some methods employ two or more genomes to infer local mutation rates and patterns along the genome.
* Homology-based methods (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with
expressed sequence tag
In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has pro ...
s (ESTs),
complementary DNA
In genetics, complementary DNA (cDNA) is DNA that was reverse transcribed (via reverse transcriptase) from an RNA (e.g., messenger RNA or microRNA). cDNA exists in both single-stranded and double-stranded forms and in both natural and engin ...
(cDNA), or
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
sequences.
* Combiners. CDS prediction is done by a combination of both methods mentioned above.
Functional annotation
Functional annotation assigns functions to the genomic elements found by structural annotation,
by relating them to biological processes such as the
cell cycle
The cell cycle, or cell-division cycle, is the sequential series of events that take place in a cell (biology), cell that causes it to divide into two daughter cells. These events include the growth of the cell, duplication of its DNA (DNA re ...
,
cell death
Cell death is the event of a biological cell ceasing to carry out its functions. This may be the result of the natural process of old cells dying and being replaced by new ones, as in programmed cell death, or may result from factors such as di ...
,
development
Development or developing may refer to:
Arts
*Development (music), the process by which thematic material is reshaped
* Photographic development
*Filmmaking, development phase, including finance and budgeting
* Development hell, when a proje ...
,
metabolism
Metabolism (, from ''metabolē'', "change") is the set of life-sustaining chemical reactions in organisms. The three main functions of metabolism are: the conversion of the energy in food to energy available to run cellular processes; the co ...
, etc.
It may also be used as an additional quality check by identifying elements that may have been annotated by error.
Coding sequence function prediction

Functional annotation of genes requires a controlled vocabulary (or ontology) to name the predicted functional features. However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups. As such, a standardized controlled vocabulary must be employed, the most comprehensive of which is the
Gene Ontology
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...
(GO). It classifies functional properties into one of three categories (molecular function, biological process, and cellular component) and organizes them in a
directed acyclic graph
In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...
, in which every node is a particular function, and every edge (or arrow) between two nodes indicates a parent-child or subcategory-category relationship.
As of 2020, GO is the most widely used controlled vocabulary for functional annotation of genes, followed by the MIPS Functional Catalog (FunCat).
Some conventional methods for functional annotation are
homology-based, which rely on local
alignment
Alignment may refer to:
Archaeology
* Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks
* Stone alignment, a linear arrangement of upright, parallel megalithic standing stones
Biology
* Struc ...
search tools.
Its premise is that high sequence conservation between two genomic elements implies that their function is conserved as well. Pairs of homologous sequences that appeared through
paralogy
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spec ...
,
orthology, or
xenology usually perform a similar function. However, orthologous sequences should be treated with caution because of two reasons: (1) they might have different names depending on when they were originally annotated, and (2) they may not perform the same functional role in two different organisms. Annotators often refer to an
analogous sequence when no paralogy, orthology or xenology was found.
Homology-based methods have several drawbacks, such as errors in the database, low sensitivity/specificity, inability to distinguish between paralogy and homology,
artificially high scores due to the presence of low complexity regions, and significant variation within a protein family.
Functional annotation can be performed through probabilistic methods. The distribution of
hydrophilic
A hydrophile is a molecule or other molecular entity that is attracted to water molecules and tends to be dissolved by water.Liddell, H.G. & Scott, R. (1940). ''A Greek-English Lexicon'' Oxford: Clarendon Press.
In contrast, hydrophobes are n ...
and
hydrophobic
In chemistry, hydrophobicity is the chemical property of a molecule (called a hydrophobe) that is seemingly repelled from a mass of water. In contrast, hydrophiles are attracted to water.
Hydrophobic molecules tend to be nonpolar and, thu ...
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
s indicates whether a protein is located in a solution or membrane. Specific
sequence motif
In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...
s provide information on
posttranslational modifications
In molecular biology, post-translational modification (PTM) is the covalent process of changing proteins following protein biosynthesis. PTMs may involve enzymes or occur spontaneously. Proteins are created by ribosomes, which translate mRNA in ...
and final location of any given protein.
Probabilistic methods may be paired with a controlled vocabulary, such as GO; for example,
protein-protein interaction (PPI) networks usually place proteins with similar functions close to each other.
Machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
methods are also used to generate functional annotations for novel proteins based on GO terms. Generally, they consist in constructing a
binary classifier for each GO term, which are then joined to make predictions on individual GO terms (forming a
multiclass classifier) for which confidence scores are later obtained. The
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
(SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as
k-nearest neighbors (kNN) and
convolutional neural network
A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...
(CNN), have also been employed.
Binary or multiclass classification methods for functional annotation generally produce less accurate results because they do not take into account the interrelations between GO terms. More advanced methods that consider these interrelations do so by either a flat or hierarchical approach, which are distinguished by the fact that the former does not take into account the ontology structure, while the latter does. Some of these methods compress the GO terms by
matrix factorization
In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...
or by
hashing, thus boosting their performance.
Noncoding sequence function prediction
Noncoding sequences (ncDNA) are those that do not code for proteins. They include elements such as pseudogenes, segmental duplications, binding sites and RNA genes.
Pseudogene
Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Pseudogenes can be formed from both protein-coding genes and non-coding genes. In the case of protein-coding genes, most pseudogenes arise as superfluous copies of fun ...
s are mutated copies of protein-coding genes that lost their coding function due to a disruption in their
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
(ORF), making them
untranslatable.
They may be identified using one of the following two methods:
* Homology-based method. Pseudogenes are identified by searching sequences that are similar to functional genes but contain mutations that produce a disruption in their ORF. This method cannot determine the evolutionary relationship between a pseudogene and its parent gene nor the elapsed time since the event happened.
* Phylogeny-based method. Pseudogenes are identified by means of a phylogenetic analysis. First, a species tree of the species of interest and a phylogenetic tree of the gene (or gene family) of interest are constructed. The two are then compared to identify a species that has lost the gene. Next, within the genome of the species where the gene was not found, a sequence is searched that is orthologous to the gene identified in the closest species. Finally, if this orthologous sequence has a disruption in its ORF (and it meets with other criteria, such as
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...
data analysis,
dN/dS ratio, etc.), it means that the sequence is indeed a pseudogene.
Segmental duplication
Low copy repeats (LCRs), also known as segmental duplications (SDs), or duplicons, are DNA sequences present in multiple locations within a genome that share high levels of sequence identity.
Repeats
The repeats, or duplications, are typically 10� ...
s are DNA segments of more than 1000 base pairs that are repeated in the genome with more than 90% sequence identity. Two strategies used for their identification are WGAC and WSSD:
* Whole-Genome Assembly Comparison (WGAC). It aligns the entire genome to itself in order to identify repeated sequences after filtering out common repeats; it does not require having the original reads used for the assembly.
* Whole-genome Shotgun Sequence Detection (WSSD). It aligns the original reads with the assembled genome and searches for regions with a higher read depth than the average, which usually are signals of duplication. Segmental duplications identified by this method but not by WGAC are likely collapsed duplications, which means that they were mistakenly aligned to the same region.
DNA binding site
DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence (e.g. a genome) and (2) they are bound by DNA-binding ...
s are regions in the genome sequence that bind to and interact with specific proteins. They play an important role in
DNA replication
In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all life, living organisms, acting as the most essential part of heredity, biolog ...
and
repair
The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure and supporting utilities in industrial, business, and residential installat ...
,
transcriptional regulation
In molecular biology and genetics, transcriptional regulation is the means by which a cell regulates the conversion of DNA to RNA ( transcription), thereby orchestrating gene activity. A single gene can be regulated in a range of ways, from al ...
, and
viral infection
A viral disease (or viral infection) occurs when an organism's body is invaded by pathogenic viruses, and infectious virus particles (virions) attach to and enter susceptible cells.
Examples include the common cold, gastroenteritis, COVID-19, t ...
. Binding site prediction involves the use of one of the following two methods:
* Sequence similarity based methods. They consist in the identification of homologous sequences with known DNA binding sites, or by aligning them with query proteins. Their performance is usually low because the DNA binding sequences are less
conserved.
* Structure based methods. They employ the three-dimensional structural information of proteins to predict the locations of DNA binding sites.
Noncoding RNA
A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-co ...
(ncRNA), produced by RNA genes, is a type of RNA that is not translated into a protein. It includes molecules such as
tRNA
Transfer ribonucleic acid (tRNA), formerly referred to as soluble ribonucleic acid (sRNA), is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes). In a cell, it provides the physical link between the gene ...
,
rRNA
Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
,
snoRNA, and
microRNA
Micro ribonucleic acid (microRNA, miRNA, μRNA) are small, single-stranded, non-coding RNA molecules containing 21–23 nucleotides. Found in plants, animals, and even some viruses, miRNAs are involved in RNA silencing and post-transcr ...
, as well as noncoding
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
-like transcripts. ''Ab initio'' prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead. These methods are specifically concerned with the secondary structures of ncRNA, as they are conserved in related species even when their sequence is not. Therefore, by performing a multiple sequence alignment, more useful information can be obtained for their prediction. Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.
Visualization
File formats
Visualization of annotations in a
genome browser requires a descriptive output file, which should describe the
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...
-
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...
structures of each annotation, their start and stop
codons
Genetic code is a set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links pro ...
, UTRs and alternative transcripts, and ideally should include information about the
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...
s and
gene prediction
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functio ...
s that support each gene model. Some commonly used formats for describing annotations are GenBank,
GFF3, GTF,
BED and EMBL.
Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools.
Genome browsers
Genomic browsers are software products that simplify the analysis and visualization of large genomic sequence and annotation data to gain biological insight, via a graphical interface.
Genomic browsers can be divided into web-based genomic browsers and stand-alone genomic browsers. The former use information from databases and can be classified into ''multiple-species'' (integrate sequence and annotations of multiple organisms and promote cross-species comparative analysis) and ''species-specific'' (focus on one organism and the annotations for particular species). The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.
Comparative visualization of genomes
Comparative genomics
Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach c ...
aims to identify similarities and differences in genomic features, as well as to examine evolutionary relationships between organisms.
Visualization tools capable of illustrating the comparative behavior between two or more genomes are essential for this approach, and can be classified into three categories based on the representation of the relationships between the compared genomes:
* Dot Plots: This scheme only allows to show the alignment of two genomes, one genome is represented along the horizontal axis and the other along the vertical axis and the dots in the plot represent the genomic elements that are similar between these two annotations.
* Linear representation: This representation uses multiple linear tracks to represent multiple genomes and their features where "track" is a concept that refers to a specific type of genomic feature at a genomic location.
* Circular representation: This representation facilitates comparison of whole microbial or viral genomes. In this visualization mode, concentric circles and arcs are used to represent genomic sections.
Quality control
The quality of the
sequence assembly influences the quality of the annotation, so it is important to assess assembly quality before performing the subsequent annotation steps.
In order to quantify the quality of a genome annotation, three metrics have been used:
recall,
precision and
accuracy
Accuracy and precision are two measures of ''observational error''.
''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''.
''Precision'' is how close the measurements are to each other.
The ...
; although these measures are not explicitly used in annotation projects, but rather in discussions of prediction accuracy.
Community annotation approaches are great techniques for quality control and standardization in genome annotation. An annotation jamboree that took part in 2002, led to the creation of the annotation standards used by the Sanger Institute's Human and Vertebrate Analysis Project (HAVANA).
Re-annotation
Annotation projects often rely on previous annotations of an organism's genome; however, these older annotations may contain errors that can propagate to new annotations. As new genome analysis technologies are developed and richer databases become available, the annotation of some older genomes may be updated. This process, known as reannotation, can provide users with new information about the genome, including details about genes and protein functions. Re-annotation is therefore a useful approach in quality control.
Community annotation
Community annotation consists in the engagement of a community (both scientific and nonscientific) in genome annotation projects. It can be classified into the following six categories:
* Factory model: Annotation is performed by a completely automated pipeline.
* Museum model:
Manual curation by experts is involved to interpret the results of an annotation project.
* Cottage industry model: Annotation is decentralized and is the result of the effort from different part-time curators.
* Party or jamboree model: Consists of a short intensive workshop with leading curators from the community. It was first used in the ''
Drosophila melanogaster
''Drosophila melanogaster'' is a species of fly (an insect of the Order (biology), order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the "vinegar fly", "pomace fly" ...
'' genome annotation project.
* Blessed annotator: A variation of the museum model, applied in th
Knockout Mouse Project (KOMP) in which curators go through a training period prior to annotation, and are then given access to annotation tools to continue their work.
* Gatekeeper approach: It is a combination of the jamboree and cottage industry models. It begins with an annotation workshop, followed by a decentralized collaboration to extend and refine the initial annotation. It has been used for multiple species data.
A community annotation is said to be ''supervised'' when there is a coordinator who manages the project by requesting the annotation of specific items to a select number of experts. On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called ''unsupervised'' community annotation. Supervised community annotation is short-lived and limited to the duration of the event, whereas the unsupervised counterpart does not have this limitation. However, the latter has been less successful than the former presumably due to a lack of time, motivation, incentive and/or communication.
Wikipedia has multiple WikiProjects aimed at improving annotation. The
Gene WikiProject, for instance, operates a
bot
Bot or BOT may refer to:
Sciences
Computing and technology
* Chatbot, a computer program that converses in natural language
* Internet bot, a software application that runs automated tasks (scripts) over the Internet
**Spambot, an internet bot ...
that harvests gene data from research databases and creates gene
stubs on that basis.
The
RNA WikiProject seeks to write articles that describe individual RNAs and RNA families in an accessible way.
Applications
Disease diagnosis
Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution and function under a different set of conditions, such as diseased versus healthy.
Databases of this disease-gene relationships of different organisms have been created, such as Plant-Pathogen Ontology,
Plant-Associated Microbe Gene Ontology
or DisGeNET.
And some others have been implemented in pre-existing databases like Rat Disease Ontology in the Rat Genome database.
Bioremediation
A great diversity of
catabolic
Catabolism () is the set of metabolic pathways that breaks down molecules into smaller units that are either oxidized to release energy or used in other anabolic reactions. Catabolism breaks down large molecules (such as polysaccharides, lipi ...
enzymes
An enzyme () is a protein that acts as a biological catalyst by accelerating chemical reactions. The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as pro ...
involved in
hydrocarbon
In organic chemistry, a hydrocarbon is an organic compound consisting entirely of hydrogen and carbon. Hydrocarbons are examples of group 14 hydrides. Hydrocarbons are generally colourless and Hydrophobe, hydrophobic; their odor is usually fain ...
degradation by some bacterial strains are encoded by genes located in their
mobile genetic elements
Mobile genetic elements (MGEs), sometimes called selfish genetic elements, are a type of genetic material that can move around within a genome, or that can be transferred from one species or replicon to another. MGEs are found in all organisms. In ...
(MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities.
In 2013, Phale et al.
published the genome annotation of a strain of ''
Pseudomonas putida
''Pseudomonas putida'' is a Gram-negative, rod-shaped, saprophytic soil bacterium. It has a versatile metabolism and is amenable to genetic manipulation, making it a common organism used in research, bioremediation, and synthesis of chemicals and ...
'' (CSV86), a bacterium known for its preference of
naphthalene
Naphthalene is an organic compound with formula . It is the simplest polycyclic aromatic hydrocarbon, and is a white Crystal, crystalline solid with a characteristic odor that is detectable at concentrations as low as 0.08 Parts-per notation ...
and other
aromatic compounds over
glucose
Glucose is a sugar with the Chemical formula#Molecular formula, molecular formula , which is often abbreviated as Glc. It is overall the most abundant monosaccharide, a subcategory of carbohydrates. It is mainly made by plants and most algae d ...
as a carbon and energy source.
In order to find the MGEs of this bacterium, its genome was annotated using RAST and th
NCBI Prokaryotic Genome Annotation Pipeline(PGAP), and the identification of nine mobile elements was possible with th
Insertion Sequence (IS) Finderdatabase. This analysis concluded in the localization of the upper pathway genes of naphthalene degradation,
right next to the
genes
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
encoding tRNA-Gly and integrase, as well as the identification of the genes encoding enzymes involved in the degradation of
salicylate,
benzoate
Benzoic acid () is a white (or colorless) solid organic compound with the formula , whose structure consists of a benzene ring () with a carboxyl () substituent. The benzoyl group is often abbreviated "Bz" (not to be confused with "Bn," which ...
,
4-hydroxybenzoate,
phenylacetic acid, hydroxyphenyl acetic acid, and the recognition of an
operon
In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo splic ...
involved in glucose transport in the strain.
Gene Ontology
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...
analysis is of great importance in functional annotation, and specifically in bioremediation it can be applied to know the relationships between the genes of some microorganisms with their functions and their role in the remediation of certain contaminants. This was the approach of the investigation and identification of
''Halomonas zincidurans'' strain B6(T), a bacterium with thirty-one genes encoding resistance to
heavy metals
upright=1.2, Crystals of lead.html" ;"title="osmium, a heavy metal nearly twice as dense as lead">osmium, a heavy metal nearly twice as dense as lead
Heavy metals is a controversial and ambiguous term for metallic elements with relatively h ...
, especially zinc
and
''Stenotrophomonas sp.'' DDT-1, a strain capable of using
DDT as its sole carbon and energy source,
to mention a few examples.
Software
Genes in a
eukaryotic
The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...
genome can be annotated using various annotation tools
such as FINDER.
A modern annotation
pipeline
A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
can support a user-friendly web interface and software containerization such as MOSGA.
Modern annotation pipelines for
prokaryotic
A prokaryote (; less commonly spelled procaryote) is a single-celled organism whose cell lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Greek (), meaning 'before', and (), meaning 'nut' ...
genomes are Bakta,
Prokka
and PGAP.
Th
National Center for Biomedical Ontologydevelops tools for automated annotation
of database records based on the textual descriptions of those records.
As a general method,
dcGO has an automated procedure for statistically inferring associations between ontology terms and
protein domain
In molecular biology, a protein domain is a region of a protein's Peptide, polypeptide chain that is self-stabilizing and that Protein folding, folds independently from the rest. Each domain forms a compact folded Protein tertiary structure, thre ...
s or combinations of domains from the existing gene/protein-level annotations.
A variety of software tools have been developed that allow scientists to view and share genome annotations, such a
MAKER
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available
biological databases
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...
accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
*
Encyclopedia of DNA elements (ENCODE)
*
Entrez Gene
*
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
*
FlyBase
*
GENCODE
*
Gene Ontology Consortium
*
GeneRIF
*
Mouse Genome Informatics
*
RefSeq
*
Uniprot
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
*
Vertebrate and Genome Annotation Project (Vega)
*
WormBase
WormBase is an online biological database about the biology and genome of the nematode model organism ''Caenorhabditis elegans'' and contains information about other related nematodes. WormBase is used by the ''C. elegans'' research community bo ...
References
{{Use dmy dates, date=April 2017
DNA