DNA annotation or genome annotation is the process of identifying the locations of
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
s and all of the
coding region
The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to no ...
s in a
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it. Genes in a eukaryotic genome can be annotated using various annotation tools such as FINDER.
A modern annotation pipeline can support a user-friendly web interface and software containerization such as MOSGA.
For DNA annotation, a previously unknown sequence representation of genetic material is enriched with information relating
genomic position to
intron
An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e. a region inside a gene."The notion of the cistron .e., gene ...
-
exon
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
boundaries,
regulatory sequence
A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living organisms and vi ...
s,
repeats
A rerun or repeat is a rebroadcast of an episode of a radio or television program. There are two types of reruns – those that occur during a hiatus, and those that occur when a program is syndicated.
Variations
In the United Kingdom, the word ...
,
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
names and
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
products. This annotation is stored in
genomic databases such as
Mouse Genome Informatics,
FlyBase, and
WormBase. Educational materials on some aspects of biological annotation from the 2006
Gene Ontology
The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and g ...
annotation camp and similar events are available at the Gene Ontology website.
The National Center for Biomedical Ontology (www.bioontology.org) develops tools for automated annotation
[http://bioontology.stanford.edu/annotator-service] of database records based on the textual descriptions of those records.
As a general method,
dcGO
dcGO is a comprehensive ontology database for protein domains. As an ontology resource, dcGO integrates Open Biomedical Ontologies from a variety of contexts, ranging from functional information like Gene Ontology to others on enzymes and path ...
has an automated procedure for statistically inferring associations between ontology terms and protein domains or combinations of domains from the existing gene/protein-level annotations.
Process
Genome annotation consists of three main steps:.
# identifying portions of the genome that do not code for proteins
# identifying elements on the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
, a process called
gene prediction
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functi ...
# attaching biological information to these elements
Automatic annotation tools attempt to perform these steps via computer analysis, as opposed to manual annotation (a.k.a. curation) which involves human expertise. Ideally, these approaches co-exist and complement each other in the same annotation
pipeline.
A simple method of gene annotation relies on homology based search tools, like
BLAST, to search for homologous genes in specific databases, the resulting information is then used to annotate genes and genomes.
However, as information is added to the annotation platform, manual annotators become capable of deconvoluting discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g.
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other ...
) rely on curated data sources as well as a range of different software tools in their automated genome annotation pipeline.
''Structural annotation'' consists of the identification of genomic elements.
*
ORFs and their localization
* gene structure
* coding regions
* location of regulatory motifs
''Functional annotation'' consists of attaching biological information to genomic elements.
* biochemical function
* biological function
* involved regulation and interactions
* expression
These steps may involve both biological experiments and ''
in silico
In biology and other experimental sciences, an ''in silico'' experiment is one performed on computer or via computer simulation. The phrase is pseudo-Latin for 'in silicon' (correct la, in silicio), referring to silicon in computer chips. It ...
'' analysis.
Proteogenomics based approaches utilize information from expressed proteins, often derived from
mass spectrometry
Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a '' mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is u ...
, to improve genomics annotations.
A variety of software tools have been developed to permit scientists to view and share genome annotations; for example
MAKER
Genome annotation remains a major challenge for scientists investigating the
human genome
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the ...
, now that the genome sequences of more than a thousand human individuals (The 100,000 Genomes Project, UK) and several
model organisms are largely complete.
[ ] Identifying the locations of genes and other genetic control elements is often described as defining the biological "parts list" for the assembly and normal operation of an organism.
Scientists are still at an early stage in the process of delineating this parts list and in understanding how all the parts "fit together".
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available
biological databases
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...
accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation:
*
Encyclopedia of DNA elements (ENCODE)
*
Entrez Gene
*
Ensembl
Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other ...
*
GENCODE
*
Gene Ontology Consortium
*
GeneRIF
*
RefSeq
*
Uniprot
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...
*
Vertebrate and Genome Annotation Project (Vega)
At Wikipedia, genome annotation has started to become automated under the auspices of the
Gene Wiki portal which operates a
bot
Bot may refer to:
Sciences
Computing and technology
* Chatbot, a computer program that converses in natural language
* Internet bot, a software application that runs automated tasks (scripts) over the Internet
**a Spambot, an internet bot des ...
that harvests gene data from research databases and creates gene stubs on that basis.
References
{{Use dmy dates, date=April 2017
DNA