Porphyra umbilicalis chloroplast genome visualized with Chloroplot

molecular biology Molecular biology is a branch of biology that seeks to understand the molecule, molecular basis of biological activity in and between Cell (biology), cells, including biomolecule, biomolecular synthesis, modification, mechanisms, and interactio ...

and

genetics Genetics is the study of genes, genetic variation, and heredity in organisms.Hartl D, Jones E (2005) It is an important branch in biology because heredity is vital to organisms' evolution. Gregor Mendel, a Moravian Augustinians, Augustinian ...

, DNA annotation or genome annotation is the process of describing the structure and function of the components of a

genome A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...

, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of

gene In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...

s and all the

coding region The coding region of a gene, also known as the coding DNA sequence (CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared ...

s in a genome and determines what those genes do. Annotation is performed after a genome is

sequenced In genetics and biochemistry, sequencing means to determine the primary structure (sometimes incorrectly called the primary sequence) of an unbranched biopolymer. Sequencing results in a symbolic linear depiction known as a sequence which succi ...

and assembled, and is a necessary step in genome analysis before the sequence is deposited in a

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

and described in a published article. Although describing individual genes and their products or functions is sufficient to consider this description as an annotation, the depth of analysis reported in literature for different genomes vary widely, with some reports including additional information that goes beyond a simple annotation. Furthermore, due to the size and complexity of sequenced genomes, DNA annotation is not performed manually, but is instead automated by computational means. However, the conclusions drawn from the obtained results require manual expert analysis. DNA annotation is classified into two categories: ''structural annotation'', which identifies and demarcates elements in a genome, and ''functional annotation'', which assigns functions to these elements. This is not the only way in which it has been categorized, as several alternatives, such as dimension-based and level-based classifications, have also been proposed.

History

The first generation of genome annotators used local ''

ab initio ( ) is a Latin term meaning "from the beginning" and is derived from the Latin ("from") + , ablative singular of ("beginning"). Etymology , from Latin, literally "from the beginning", from ablative case of "entrance", "beginning", related t ...

'' methods, which are based solely on the information that can be extracted from the DNA sequence on a local scale, that is, one

open reading frame In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...

(ORF) at a time. They appeared as a necessity to handle the enormous amount of data produced by the Maxam-Gilbert and Sanger

DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...

techniques developed in the late 1970s. The first software used to analyze sequencing reads is the Staden Package, created by Rodger Staden in 1977. It performed several tasks related to annotation, such as base and

codon Genetic code is a set of rules used by living cells to translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links prote ...

counts. In fact, codon usage was the main strategy used by several early protein coding sequence (CDS) prediction methods, based on the assumption that the most translated regions in a genome contain codons with the most abundant corresponding

tRNA Transfer ribonucleic acid (tRNA), formerly referred to as soluble ribonucleic acid (sRNA), is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes). In a cell, it provides the physical link between the gene ...

s (the molecules responsible for carrying

amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...

s to the

ribosome Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...

during protein synthesis) allowing a more efficient translation. This was also known to be the case for synonymous codons, which are often present in proteins expressed at a lower level. The advent of complete genomes in the 1990s (the first one being the genome of ''

Haemophilus influenzae ''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, Motility, non-motile, Coccobacillus, coccobacillary, facultative anaerobic organism, facultatively anaerobic, Capnophile, capnophili ...

'' sequenced in 1995) introduced a second generation of annotators. Just like in the previous generation, they performed annotation through ''ab initio'' methods, but now applied on a genome-wide scale.

Markov model In probability theory, a Markov model is a stochastic model used to Mathematical model, model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, ...

s are the driving force behind many algorithms used within annotators of this generation; these models can be thought of as directed graphs where nodes represent different genomic signals (such as transcription and

translation Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...

start sites) connected by arrows representing the scanning of the sequence. To ensure a Markov model detects a genomic signal, it must first be trained on a series of known genomic signals. The output of Markov models in the context of annotation includes the probabilities of every kind of genomic element in every single part of the genome, and an accurate Markov model will assign high probabilities to correct annotations and low probabilities to the incorrect ones. Genome Annotation Timeline

As more sequenced genomes began to be available in early and mid 2000s, coupled with the numerous protein sequences that were obtained experimentally, genome annotators began employing homology based methods, launching the third generation of genome annotation. These new methods allowed annotators not only to infer genomic elements through statistical means (as in previous generations) but could also perform their task by comparing the sequence being annotated with other already existing and validated sequences. These so-called combiner annotators, which perform both ''ab initio'' and homology-based annotation, require fast

alignment Alignment may refer to: Archaeology * Alignment (archaeology), a co-linear arrangement of features or structures with external landmarks * Stone alignment, a linear arrangement of upright, parallel megalithic standing stones Biology * Struc ...

algorithms to identify regions of homology. In the late 2000s, genome annotation shifted its attention towards identifying

non-coding region Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules (e.g. transfer RNA, microRNA, piRNA, ribosomal RNA, and regula ...

s in DNA, which was achieved thanks to the appearance of methods to analyze transcription factor binding sites,

DNA methylation DNA methylation is a biological process by which methyl groups are added to the DNA molecule. Methylation can change the activity of a DNA segment without changing the sequence. When located in a gene promoter (genetics), promoter, DNA methylati ...

sites,

chromatin Chromatin is a complex of DNA and protein found in eukaryote, eukaryotic cells. The primary function is to package long DNA molecules into more compact, denser structures. This prevents the strands from becoming tangled and also plays important r ...

structure, and other

RNA Ribonucleic acid (RNA) is a polymeric molecule that is essential for most biological functions, either by performing the function itself (non-coding RNA) or by forming a template for the production of proteins (messenger RNA). RNA and deoxyrib ...

and regulatory region analysis techniques. Other genome annotators also began to focus on population-level studies represented by the pangenome; by doing so, for instance, annotation pipelines ensure that core genes of a

clade In biology, a clade (), also known as a Monophyly, monophyletic group or natural group, is a group of organisms that is composed of a common ancestor and all of its descendants. Clades are the fundamental unit of cladistics, a modern approach t ...

are also found in new genomes of the same clade. Both annotation strategies constitute the fourth generation of genome annotators. By the 2010s, the genome sequences of more than a thousand-human individuals (through the

1000 Genomes Project The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least o ...

) and several

model organisms A model organism is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workings of other organisms. Mo ...

became available. As such, genome annotation remains a major challenge for scientists investigating the

human Humans (''Homo sapiens'') or modern humans are the most common and widespread species of primate, and the last surviving species of the genus ''Homo''. They are Hominidae, great apes characterized by their Prehistory of nakedness and clothing ...

and other genomes.

Structural annotation

Structural annotation describes the precise location of the different elements in a genome, such as

s (ORFs), coding sequences (CDS),

exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...

intron An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...

s, repeats, splice sites, regulatory motifs,

start Start can refer to multiple topics: * Takeoff, the phase of flight where an aircraft transitions from moving along the ground to flying through the air * Starting lineup in sports * Track and field#Starts use in race, Starts use in sport race * S ...

and stop

codons Genetic code is a set of rules used by living cells to translate information encoded within genetic material ( DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished by the ribosome, which links pro ...

, and promoters. The main steps of structural annotation are: # Repeat identification and masking. # Evidence alignment (optional). # Splice identification (only in eukaryotes). # Feature prediction (coding and noncoding sequences).

Repeat identification and masking

The first step of structural annotation consists in the identification and masking of repeats, which include low-complexity sequences (such as AGAGAGAG, or monopolymeric segments like TTTTTTTTT), and

transposon A transposable element (TE), also transposon, or jumping gene, is a type of mobile genetic element, a nucleic acid sequence in DNA that can change its position within a genome. The discovery of mobile genetic elements earned Barbara McClinto ...

s (which are larger elements with several copies across the genome). Repeats are a major component of both prokaryotic and eukaryotic genomes; for instance, between 0% and over 42% of prokaryotic genomes consist of repeats and three quarters of the

human genome The human genome is a complete set of nucleic acid sequences for humans, encoded as the DNA within each of the 23 distinct chromosomes in the cell nucleus. A small DNA molecule is found within individual Mitochondrial DNA, mitochondria. These ar ...

are composed of repetitive elements. Identifying repeats is difficult for two main reasons: they are poorly conserved, and their boundaries are not clearly-defined. Because of this, repeat libraries must be built for the genome of interest, which can be accomplished with one of the following methods: * ''De novo'' methods. Repeats are identified by detecting and grouping pairs of sequences at different locations whose similarity is above a minimum threshold of

sequence conservation In evolutionary biology, conserved sequences are identical or similar sequences in nucleic acids (DNA and RNA) or proteins across species ( orthologous sequences), or within a genome ( paralogous sequences), or between donor and receptor taxa ( ...

in a self-genome comparison, thus requiring no prior information about repeat structure or sequences. The disadvantage of these methods is that they can identify any repeated sequence, not just transposons, and may include conserved coding sequences (CDS), making careful post-processing an indispensable step to remove these sequences. It may also leave out related regions that have degraded over time and may group elements that have no connection in their evolutionary history. * Homology-based methods. Repeats are identified by similarity ( homology) of known repeats stored in a curated database. These methods are more likely to find real transposons, even in lower quantities, when compared with ''de novo'' methods, but are biased towards previously identified families. * Structure-based methods. Repeats are identified based on models of their structure, rather than repetition or similarity. They are capable of identifying real transposons (just like the homology-based ones), but are not biased by known elements. However, they are highly specific to each class of repeat, and, as such, are less universally applicable. * Comparative genomic methods. Repeats are identified as disruptions of one or more sequences in a

multiple sequence alignment Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis an ...

produced by large insertion regions. Although this strategy avoids the poorly-defined boundary problem that exists in other methods, it is highly dependent on assembly quality and the level of activity of transposons in the genomes in question. After the repetitive regions in a genome have been identified, they are masked. ''Masking'' means replacing the letters of the

nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

s (A, C, G, or T) with other letters. By doing so, these regions will be marked as repetitive and downstream analyses will treat them accordingly. Repetitive regions may produce performance issues if they are not masked, and may even produce false evidence for gene annotation (for example, treating an

(ORF) in a transposon as an

) Depending on the letters used for replacement, masking can be classified as soft or hard: in ''soft masking'', repetitive regions are indicated with lowercase letters (a, c, g, or t), whereas in ''hard masking'', the letters of these regions are replaced with N's. This way, for example, soft masking can be used to exclude word matches and avoid initiating an

in those regions, and hard masking, apart from all of this, can also exclude masked regions from alignment scores.

Evidence alignment

The next step after genome masking usually involves aligning all available transcript and protein evidence with the analyzed genome, that is, aligning all known

expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has pro ...

s (ESTs),

s and

protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...

s of the organism being annotated with the genome. Although it is optional, it can improve gene sequence elucidation because RNAs and proteins are direct products of coding sequences. If

RNA-Seq RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...

data is available, it may be used to annotate and quantify all of the genes and their

isoforms A protein isoform, or "protein variant", is a member of a set of highly similar proteins that originate from a single gene and are the result of genetic differences. While many perform the same or similar biological roles, some isoforms have uniqu ...

located in the corresponding genome, providing not only their locations, but also their rates of expression. However, transcripts provide insufficient information for gene prediction because they might be unobtainable from some genes, they may encode

operon In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo splic ...

s of more than one gene, and their start and stop codons cannot be determined due to frameshifts and translation initiation factors. To solve this problem, proteogenomics based approaches are employed, which utilize information from expressed proteins often derived from

mass spectrometry Mass spectrometry (MS) is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a ''mass spectrum'', a plot of intensity as a function of the mass-to-charge ratio. Mass spectrometry is used ...

Splice identification

Annotation of

eukaryotic The eukaryotes ( ) constitute the Domain (biology), domain of Eukaryota or Eukarya, organisms whose Cell (biology), cells have a membrane-bound cell nucleus, nucleus. All animals, plants, Fungus, fungi, seaweeds, and many unicellular organisms ...

genomes has an extra layer of difficulty due to

RNA splicing RNA splicing is a process in molecular biology where a newly-made precursor messenger RNA (pre-mRNA) transcription (biology), transcript is transformed into a mature messenger RNA (Messenger RNA, mRNA). It works by removing all the introns (non-cod ...

, a post-transcriptional process in which

s (non-coding regions) are removed and

s (coding regions) are joined. Therefore, eukaryotic coding sequences (CDS) are discontinuous, and, to ensure their proper identification, intronic regions must be filtered. To do so, annotation pipelines must find the exon-intron boundaries, and multiple methodologies have been developed for this purpose. One solution is to use known exon boundaries for alignment; for instance, many introns begin with GT and end with AG. This approach, however, cannot detect novel boundaries, so alternatives like

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

algorithms exist that are trained on known exon boundaries and quality information to predict new ones. Predictors of new exon boundaries usually require efficient data-compression and alignment algorithms, but they are prone to failure in boundaries located in regions with low sequence coverage or high error-rates produced during sequencing.

Feature prediction

A genome is divided in coding and noncoding regions, and the last step of structural annotation consists in identifying these features within the genome. In fact, the primary task in genome annotation is

gene prediction In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functio ...

, which is why numerous methods have been developed for this purpose. Gene prediction is a misleading term, as most gene predictors only identify coding sequences (CDS) and do not report

untranslated region In molecular genetics, an untranslated region (or UTR) refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the Directionality (molecular biology), 5' side, it is called the Five prime ...

s (UTRs); for this reason, CDS prediction has been proposed as a more accurate term. CDS predictors detect genome features through methods called ''sensors'', which include ''signal sensors'' that identify functional site signals such as promoters and polyA sites, and ''content sensors'' that classify DNA sequences into coding and noncoding content. Whereas

prokaryotic A prokaryote (; less commonly spelled procaryote) is a single-celled organism whose cell lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Greek (), meaning 'before', and (), meaning 'nut' ...

CDS predictors mostly deal with open reading frames (ORFs), which are segments of DNA between the

and stop

CDS predictors are faced with a more difficult problem because of the complex organization of eukaryotic genes. CDS prediction methods can be classified into three broad categories: * ''Ab initio'' methods (also called statistical, intrinsic, or de novo). CDS prediction is based solely on the information that can be extracted from the DNA sequence. They rely on statistical methods such as the

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

(HMM). Some methods employ two or more genomes to infer local mutation rates and patterns along the genome. * Homology-based methods (also called empirical, evidence-driven, or extrinsic). CDS prediction is based on similarity to known sequences. Specifically, it performs alignments of the analyzed sequence with

s (ESTs),

complementary DNA In genetics, complementary DNA (cDNA) is DNA that was reverse transcribed (via reverse transcriptase) from an RNA (e.g., messenger RNA or microRNA). cDNA exists in both single-stranded and double-stranded forms and in both natural and engin ...

(cDNA), or

sequences. * Combiners. CDS prediction is done by a combination of both methods mentioned above.

Functional annotation

Functional annotation assigns functions to the genomic elements found by structural annotation, by relating them to biological processes such as the

cell cycle The cell cycle, or cell-division cycle, is the sequential series of events that take place in a cell (biology), cell that causes it to divide into two daughter cells. These events include the growth of the cell, duplication of its DNA (DNA re ...

cell death Cell death is the event of a biological cell ceasing to carry out its functions. This may be the result of the natural process of old cells dying and being replaced by new ones, as in programmed cell death, or may result from factors such as di ...

development Development or developing may refer to: Arts *Development (music), the process by which thematic material is reshaped * Photographic development *Filmmaking, development phase, including finance and budgeting * Development hell, when a proje ...

metabolism Metabolism (, from ''metabolē'', "change") is the set of life-sustaining chemical reactions in organisms. The three main functions of metabolism are: the conversion of the energy in food to energy available to run cellular processes; the co ...

, etc. It may also be used as an additional quality check by identifying elements that may have been annotated by error.

Coding sequence function prediction

Functional annotation of genes requires a controlled vocabulary (or ontology) to name the predicted functional features. However, because there are numerous ways to define gene functions, the annotation process may be hindered when it is performed by different research groups. As such, a standardized controlled vocabulary must be employed, the most comprehensive of which is the

Gene Ontology The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and ...

(GO). It classifies functional properties into one of three categories (molecular function, biological process, and cellular component) and organizes them in a

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...

, in which every node is a particular function, and every edge (or arrow) between two nodes indicates a parent-child or subcategory-category relationship. As of 2020, GO is the most widely used controlled vocabulary for functional annotation of genes, followed by the MIPS Functional Catalog (FunCat). Some conventional methods for functional annotation are homology-based, which rely on local

search tools. Its premise is that high sequence conservation between two genomic elements implies that their function is conserved as well. Pairs of homologous sequences that appeared through

paralogy Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a spec ...

, orthology, or xenology usually perform a similar function. However, orthologous sequences should be treated with caution because of two reasons: (1) they might have different names depending on when they were originally annotated, and (2) they may not perform the same functional role in two different organisms. Annotators often refer to an analogous sequence when no paralogy, orthology or xenology was found. Homology-based methods have several drawbacks, such as errors in the database, low sensitivity/specificity, inability to distinguish between paralogy and homology, artificially high scores due to the presence of low complexity regions, and significant variation within a protein family. Functional annotation can be performed through probabilistic methods. The distribution of

hydrophilic A hydrophile is a molecule or other molecular entity that is attracted to water molecules and tends to be dissolved by water.Liddell, H.G. & Scott, R. (1940). ''A Greek-English Lexicon'' Oxford: Clarendon Press. In contrast, hydrophobes are n ...

and

hydrophobic In chemistry, hydrophobicity is the chemical property of a molecule (called a hydrophobe) that is seemingly repelled from a mass of water. In contrast, hydrophiles are attracted to water. Hydrophobic molecules tend to be nonpolar and, thu ...

s indicates whether a protein is located in a solution or membrane. Specific

sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ''A ...

s provide information on

posttranslational modifications In molecular biology, post-translational modification (PTM) is the covalent process of changing proteins following protein biosynthesis. PTMs may involve enzymes or occur spontaneously. Proteins are created by ribosomes, which translate mRNA in ...

and final location of any given protein. Probabilistic methods may be paired with a controlled vocabulary, such as GO; for example, protein-protein interaction (PPI) networks usually place proteins with similar functions close to each other.

Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

methods are also used to generate functional annotations for novel proteins based on GO terms. Generally, they consist in constructing a binary classifier for each GO term, which are then joined to make predictions on individual GO terms (forming a multiclass classifier) for which confidence scores are later obtained. The

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

(SVM) is the most widely used binary classifier in functional annotation; however, other algorithms, such as k-nearest neighbors (kNN) and

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

(CNN), have also been employed. Binary or multiclass classification methods for functional annotation generally produce less accurate results because they do not take into account the interrelations between GO terms. More advanced methods that consider these interrelations do so by either a flat or hierarchical approach, which are distinguished by the fact that the former does not take into account the ontology structure, while the latter does. Some of these methods compress the GO terms by

matrix factorization In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of ...

or by hashing, thus boosting their performance.

Noncoding sequence function prediction

Noncoding sequences (ncDNA) are those that do not code for proteins. They include elements such as pseudogenes, segmental duplications, binding sites and RNA genes.

Pseudogene Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Pseudogenes can be formed from both protein-coding genes and non-coding genes. In the case of protein-coding genes, most pseudogenes arise as superfluous copies of fun ...

s are mutated copies of protein-coding genes that lost their coding function due to a disruption in their

(ORF), making them untranslatable. They may be identified using one of the following two methods: * Homology-based method. Pseudogenes are identified by searching sequences that are similar to functional genes but contain mutations that produce a disruption in their ORF. This method cannot determine the evolutionary relationship between a pseudogene and its parent gene nor the elapsed time since the event happened. * Phylogeny-based method. Pseudogenes are identified by means of a phylogenetic analysis. First, a species tree of the species of interest and a phylogenetic tree of the gene (or gene family) of interest are constructed. The two are then compared to identify a species that has lost the gene. Next, within the genome of the species where the gene was not found, a sequence is searched that is orthologous to the gene identified in the closest species. Finally, if this orthologous sequence has a disruption in its ORF (and it meets with other criteria, such as

data analysis, dN/dS ratio, etc.), it means that the sequence is indeed a pseudogene.

Segmental duplication Low copy repeats (LCRs), also known as segmental duplications (SDs), or duplicons, are DNA sequences present in multiple locations within a genome that share high levels of sequence identity. Repeats The repeats, or duplications, are typically 10� ...

s are DNA segments of more than 1000 base pairs that are repeated in the genome with more than 90% sequence identity. Two strategies used for their identification are WGAC and WSSD: * Whole-Genome Assembly Comparison (WGAC). It aligns the entire genome to itself in order to identify repeated sequences after filtering out common repeats; it does not require having the original reads used for the assembly. * Whole-genome Shotgun Sequence Detection (WSSD). It aligns the original reads with the assembled genome and searches for regions with a higher read depth than the average, which usually are signals of duplication. Segmental duplications identified by this method but not by WGAC are likely collapsed duplications, which means that they were mistakenly aligned to the same region.

DNA binding site DNA binding sites are a type of binding site found in DNA where other molecules may bind. DNA binding sites are distinct from other binding sites in that (1) they are part of a DNA sequence (e.g. a genome) and (2) they are bound by DNA-binding ...

s are regions in the genome sequence that bind to and interact with specific proteins. They play an important role in

DNA replication In molecular biology, DNA replication is the biological process of producing two identical replicas of DNA from one original DNA molecule. DNA replication occurs in all life, living organisms, acting as the most essential part of heredity, biolog ...

and

repair The technical meaning of maintenance involves functional checks, servicing, repairing or replacing of necessary devices, equipment, machinery, building infrastructure and supporting utilities in industrial, business, and residential installat ...

transcriptional regulation In molecular biology and genetics, transcriptional regulation is the means by which a cell regulates the conversion of DNA to RNA ( transcription), thereby orchestrating gene activity. A single gene can be regulated in a range of ways, from al ...

, and

viral infection A viral disease (or viral infection) occurs when an organism's body is invaded by pathogenic viruses, and infectious virus particles (virions) attach to and enter susceptible cells. Examples include the common cold, gastroenteritis, COVID-19, t ...

. Binding site prediction involves the use of one of the following two methods: * Sequence similarity based methods. They consist in the identification of homologous sequences with known DNA binding sites, or by aligning them with query proteins. Their performance is usually low because the DNA binding sequences are less conserved. * Structure based methods. They employ the three-dimensional structural information of proteins to predict the locations of DNA binding sites.

Noncoding RNA A non-coding RNA (ncRNA) is a functional RNA molecule that is not translated into a protein. The DNA sequence from which a functional non-coding RNA is transcribed is often called an RNA gene. Abundant and functionally important types of non-co ...

(ncRNA), produced by RNA genes, is a type of RNA that is not translated into a protein. It includes molecules such as

rRNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...

, snoRNA, and

microRNA Micro ribonucleic acid (microRNA, miRNA, μRNA) are small, single-stranded, non-coding RNA molecules containing 21–23 nucleotides. Found in plants, animals, and even some viruses, miRNAs are involved in RNA silencing and post-transcr ...

, as well as noncoding

mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein. mRNA is ...

-like transcripts. ''Ab initio'' prediction of RNA genes in a single genome often yields inaccurate results (with an exception being miRNA), so multi-genome comparative methods are used instead. These methods are specifically concerned with the secondary structures of ncRNA, as they are conserved in related species even when their sequence is not. Therefore, by performing a multiple sequence alignment, more useful information can be obtained for their prediction. Homology search may also be employed to identify RNA genes, but this procedure is complicated, especially in eukaryotes, due to presence of a large number of repeats and pseudogenes.

Visualization

File formats

Visualization of annotations in a genome browser requires a descriptive output file, which should describe the

structures of each annotation, their start and stop

, UTRs and alternative transcripts, and ideally should include information about the

sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural biology, structural, or evolutionary relationships between ...

s and

s that support each gene model. Some commonly used formats for describing annotations are GenBank, GFF3, GTF, BED and EMBL. Some of these formats use controlled vocabularies and ontologies to define their descriptive terminologies and guarantee interoperability between analysis and visualization tools.

Genome browsers

Genomic browsers are software products that simplify the analysis and visualization of large genomic sequence and annotation data to gain biological insight, via a graphical interface. Genomic browsers can be divided into web-based genomic browsers and stand-alone genomic browsers. The former use information from databases and can be classified into ''multiple-species'' (integrate sequence and annotations of multiple organisms and promote cross-species comparative analysis) and ''species-specific'' (focus on one organism and the annotations for particular species). The latter are not necessarily linked to a specific genome database but are general-purpose browsers that can be downloaded and installed as an application on a local computer.

Comparative visualization of genomes

Comparative genomics Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach c ...

aims to identify similarities and differences in genomic features, as well as to examine evolutionary relationships between organisms. Visualization tools capable of illustrating the comparative behavior between two or more genomes are essential for this approach, and can be classified into three categories based on the representation of the relationships between the compared genomes: * Dot Plots: This scheme only allows to show the alignment of two genomes, one genome is represented along the horizontal axis and the other along the vertical axis and the dots in the plot represent the genomic elements that are similar between these two annotations. * Linear representation: This representation uses multiple linear tracks to represent multiple genomes and their features where "track" is a concept that refers to a specific type of genomic feature at a genomic location. * Circular representation: This representation facilitates comparison of whole microbial or viral genomes. In this visualization mode, concentric circles and arcs are used to represent genomic sections.

Quality control

The quality of the sequence assembly influences the quality of the annotation, so it is important to assess assembly quality before performing the subsequent annotation steps. In order to quantify the quality of a genome annotation, three metrics have been used: recall, precision and

accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements (observations or readings) are to their ''true value''. ''Precision'' is how close the measurements are to each other. The ...

; although these measures are not explicitly used in annotation projects, but rather in discussions of prediction accuracy. Community annotation approaches are great techniques for quality control and standardization in genome annotation. An annotation jamboree that took part in 2002, led to the creation of the annotation standards used by the Sanger Institute's Human and Vertebrate Analysis Project (HAVANA).

Re-annotation

Annotation projects often rely on previous annotations of an organism's genome; however, these older annotations may contain errors that can propagate to new annotations. As new genome analysis technologies are developed and richer databases become available, the annotation of some older genomes may be updated. This process, known as reannotation, can provide users with new information about the genome, including details about genes and protein functions. Re-annotation is therefore a useful approach in quality control.

Community annotation

Community annotation consists in the engagement of a community (both scientific and nonscientific) in genome annotation projects. It can be classified into the following six categories: * Factory model: Annotation is performed by a completely automated pipeline. * Museum model: Manual curation by experts is involved to interpret the results of an annotation project. * Cottage industry model: Annotation is decentralized and is the result of the effort from different part-time curators. * Party or jamboree model: Consists of a short intensive workshop with leading curators from the community. It was first used in the ''

Drosophila melanogaster ''Drosophila melanogaster'' is a species of fly (an insect of the Order (biology), order Diptera) in the family Drosophilidae. The species is often referred to as the fruit fly or lesser fruit fly, or less commonly the "vinegar fly", "pomace fly" ...

'' genome annotation project. * Blessed annotator: A variation of the museum model, applied in th
Knockout Mouse Project (KOMP)
in which curators go through a training period prior to annotation, and are then given access to annotation tools to continue their work. * Gatekeeper approach: It is a combination of the jamboree and cottage industry models. It begins with an annotation workshop, followed by a decentralized collaboration to extend and refine the initial annotation. It has been used for multiple species data. A community annotation is said to be ''supervised'' when there is a coordinator who manages the project by requesting the annotation of specific items to a select number of experts. On the other hand, when anyone can enter a project and coordination is accomplished in a decentralized manner, it is called ''unsupervised'' community annotation. Supervised community annotation is short-lived and limited to the duration of the event, whereas the unsupervised counterpart does not have this limitation. However, the latter has been less successful than the former presumably due to a lack of time, motivation, incentive and/or communication. Wikipedia has multiple WikiProjects aimed at improving annotation. The Gene WikiProject, for instance, operates a

bot Bot or BOT may refer to: Sciences Computing and technology * Chatbot, a computer program that converses in natural language * Internet bot, a software application that runs automated tasks (scripts) over the Internet **Spambot, an internet bot ...

that harvests gene data from research databases and creates gene stubs on that basis. The RNA WikiProject seeks to write articles that describe individual RNAs and RNA families in an accessible way.

Applications

Disease diagnosis

Gene Ontology is being used by researchers to establish a disease-gene relationship, as GO helps in the identification of novel genes, the alterations in their expression, distribution and function under a different set of conditions, such as diseased versus healthy. Databases of this disease-gene relationships of different organisms have been created, such as Plant-Pathogen Ontology, Plant-Associated Microbe Gene Ontology or DisGeNET. And some others have been implemented in pre-existing databases like Rat Disease Ontology in the Rat Genome database.

Bioremediation

A great diversity of

catabolic Catabolism () is the set of metabolic pathways that breaks down molecules into smaller units that are either oxidized to release energy or used in other anabolic reactions. Catabolism breaks down large molecules (such as polysaccharides, lipi ...

enzymes An enzyme () is a protein that acts as a biological catalyst by accelerating chemical reactions. The molecules upon which enzymes may act are called substrates, and the enzyme converts the substrates into different molecules known as pro ...

involved in

hydrocarbon In organic chemistry, a hydrocarbon is an organic compound consisting entirely of hydrogen and carbon. Hydrocarbons are examples of group 14 hydrides. Hydrocarbons are generally colourless and Hydrophobe, hydrophobic; their odor is usually fain ...

degradation by some bacterial strains are encoded by genes located in their

mobile genetic elements Mobile genetic elements (MGEs), sometimes called selfish genetic elements, are a type of genetic material that can move around within a genome, or that can be transferred from one species or replicon to another. MGEs are found in all organisms. In ...

(MGEs). The study of these elements is of great importance in the field of bioremediation, since recently the inoculation of wild or genetically modified strains with these MGEs has been sought in order to acquire these hydrocarbon degradation capacities. In 2013, Phale et al. published the genome annotation of a strain of ''

Pseudomonas putida ''Pseudomonas putida'' is a Gram-negative, rod-shaped, saprophytic soil bacterium. It has a versatile metabolism and is amenable to genetic manipulation, making it a common organism used in research, bioremediation, and synthesis of chemicals and ...

'' (CSV86), a bacterium known for its preference of

naphthalene Naphthalene is an organic compound with formula . It is the simplest polycyclic aromatic hydrocarbon, and is a white Crystal, crystalline solid with a characteristic odor that is detectable at concentrations as low as 0.08 Parts-per notation ...

and other aromatic compounds over

glucose Glucose is a sugar with the Chemical formula#Molecular formula, molecular formula , which is often abbreviated as Glc. It is overall the most abundant monosaccharide, a subcategory of carbohydrates. It is mainly made by plants and most algae d ...

as a carbon and energy source. In order to find the MGEs of this bacterium, its genome was annotated using RAST and th
NCBI Prokaryotic Genome Annotation Pipeline
(PGAP), and the identification of nine mobile elements was possible with th
Insertion Sequence (IS) Finder
database. This analysis concluded in the localization of the upper pathway genes of naphthalene degradation, right next to the

genes In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...

encoding tRNA-Gly and integrase, as well as the identification of the genes encoding enzymes involved in the degradation of salicylate,

benzoate Benzoic acid () is a white (or colorless) solid organic compound with the formula , whose structure consists of a benzene ring () with a carboxyl () substituent. The benzoyl group is often abbreviated "Bz" (not to be confused with "Bn," which ...

, 4-hydroxybenzoate, phenylacetic acid, hydroxyphenyl acetic acid, and the recognition of an

involved in glucose transport in the strain.

analysis is of great importance in functional annotation, and specifically in bioremediation it can be applied to know the relationships between the genes of some microorganisms with their functions and their role in the remediation of certain contaminants. This was the approach of the investigation and identification of ''Halomonas zincidurans'' strain B6(T), a bacterium with thirty-one genes encoding resistance to

heavy metals upright=1.2, Crystals of lead.html" ;"title="osmium, a heavy metal nearly twice as dense as lead">osmium, a heavy metal nearly twice as dense as lead Heavy metals is a controversial and ambiguous term for metallic elements with relatively h ...

, especially zinc and ''Stenotrophomonas sp.'' DDT-1, a strain capable of using DDT as its sole carbon and energy source, to mention a few examples.

Software

Genes in a

genome can be annotated using various annotation tools such as FINDER. A modern annotation

pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...

can support a user-friendly web interface and software containerization such as MOSGA. Modern annotation pipelines for

genomes are Bakta, Prokka and PGAP. Th
National Center for Biomedical Ontology
develops tools for automated annotation of database records based on the textual descriptions of those records. As a general method, dcGO has an automated procedure for statistically inferring associations between ontology terms and

protein domain In molecular biology, a protein domain is a region of a protein's Peptide, polypeptide chain that is self-stabilizing and that Protein folding, folds independently from the rest. Each domain forms a compact folded Protein tertiary structure, thre ...

s or combinations of domains from the existing gene/protein-level annotations. A variety of software tools have been developed that allow scientists to view and share genome annotations, such a
MAKER
Genome annotation is an active area of investigation and involves a number of different organizations in the life science community which publish the results of their efforts in publicly available

biological databases Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...

accessible via the web and other electronic means. Here is an alphabetical listing of on-going projects relevant to genome annotation: * Encyclopedia of DNA elements (ENCODE) * Entrez Gene *

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

* FlyBase * GENCODE * Gene Ontology Consortium * GeneRIF * Mouse Genome Informatics * RefSeq *

Uniprot UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived fro ...

* Vertebrate and Genome Annotation Project (Vega) *

WormBase WormBase is an online biological database about the biology and genome of the nematode model organism ''Caenorhabditis elegans'' and contains information about other related nematodes. WormBase is used by the ''C. elegans'' research community bo ...

References

{{Use dmy dates, date=April 2017 DNA