
In
computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode
genes
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
. This includes protein-coding
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s as well as
RNA genes, but may also include prediction of other functional elements such as
regulatory regions
A regulatory sequence is a segment of a nucleic acid molecule which is capable of increasing or decreasing the Gene expression, expression of specific genes within an organism. Regulation of gene expression is an essential feature of all living o ...
. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been
sequenced.
In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of
homologous recombination
Homologous recombination is a type of genetic recombination in which genetic information is exchanged between two similar or identical molecules of double-stranded or single-stranded nucleic acids (usually DNA as in Cell (biology), cellular organi ...
of several different genes could determine their order on a certain
chromosome
A chromosome is a package of DNA containing part or all of the genetic material of an organism. In most chromosomes, the very long thin DNA fibers are coated with nucleosome-forming packaging proteins; in eukaryotic cells, the most import ...
, and information from many such experiments could be combined to create a
genetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.
Determining that a sequence is functional should be distinguished from determining
the function of the gene or its product. Predicting the function of a gene and confirming that the gene prediction is accurate still demands ''
in vivo
Studies that are ''in vivo'' (Latin for "within the living"; often not italicized in English) are those in which the effects of various biological entities are tested on whole, living organisms or cells, usually animals, including humans, an ...
'' experimentation
through
gene knockout
Gene knockouts (also known as gene deletion or gene inactivation) are a widely used genetic engineering technique that involves the gene targeting, targeted removal or inactivation of a specific gene within an organism's genome. This can be done t ...
and other assays, although frontiers of
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
research are making it increasingly possible to predict the function of a gene based on its sequence alone.
Gene prediction is one of the key steps in
genome annotation, following
sequence assembly, the filtering of non-coding regions and repeat masking.
Gene prediction is closely related to the so-called 'target search problem' investigating how
DNA-binding proteins
DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for DNA#Base pairing, single- or double-stranded DNA. Sequence-specific DNA-binding proteins generally interact with the major groove ...
(
transcription factors
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. The fun ...
) locate specific
binding sites within the
genome
A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
.
Many aspects of structural gene prediction are based on current understanding of underlying
biochemical
Biochemistry, or biological chemistry, is the study of chemical processes within and relating to living organisms. A sub-discipline of both chemistry and biology, biochemistry may be divided into three fields: structural biology, enzymology, ...
processes in the
cell such as gene
transcription,
translation
Translation is the communication of the semantics, meaning of a #Source and target languages, source-language text by means of an Dynamic and formal equivalence, equivalent #Source and target languages, target-language text. The English la ...
,
protein–protein interaction
Protein–protein interactions (PPIs) are physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by interactions that include electrostatic forces, hydrogen bonding and t ...
s and
regulation processes, which are subject of active research in the various
omics
Omics is the collective characterization and quantification of entire sets of biological molecules and the investigation of how they translate into the structure, function, and dynamics of an organism or group of organisms. The branches of scien ...
fields such as
transcriptomics
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA, RNA transcripts. The information content of an organism is recorded in the DNA of its genome and Gene expression, expressed throu ...
,
proteomics
Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...
,
metabolomics
Metabolomics is the scientific study of chemical processes involving metabolites, the small molecule substrates, intermediates, and products of cell metabolism. Specifically, metabolomics is the "systematic study of the unique chemical fingerpri ...
, and more generally
structural
A structure is an arrangement and organization of interrelated elements in a material object or system, or the object or system so organized. Material structures include man-made objects such as buildings and machines and natural objects such as ...
and
functional genomics
Functional genomics is a field of molecular biology that attempts to describe gene (and protein) functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects (such as genome sequen ...
.
Empirical methods
In empirical (similarity, homology or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known
expressed sequence tags,
messenger RNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.
mRNA is created during the ...
(mRNA),
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
products, and homologous or orthologous sequences. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been
transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the
genetic code
Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...
. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. Given a sequence, local alignment algorithms such as
BLAST,
FASTA and
Smith-Waterman look for regions of similarity between the target sequence and possible candidate matches. Matches can be complete or partial, and exact or inexact. The success of this approach is limited by the contents and accuracy of the sequence database.
A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of
cell types, which presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.
Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the
RefSeq database contains transcript and protein sequence from many different species, and the
Ensembl system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data.
New high-throughput
transcriptome
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The ...
sequencing technologies such as
RNA-Seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...
and
ChIP-sequencing
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated prote ...
open opportunities for incorporating additional extrinsic evidence into gene prediction and validation, and allow structurally rich and more accurate alternative to previous methods of measuring
gene expression
Gene expression is the process (including its Regulation of gene expression, regulation) by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, proteins or non-coding RNA, ...
such as
expressed sequence tag
In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has pro ...
or
DNA microarray
A DNA microarray (also commonly known as a DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or t ...
.
Major challenges involved in gene prediction involve dealing with sequencing errors in raw DNA data, dependence on the quality of the
sequence assembly, handling short reads,
frameshift mutation
A frameshift mutation (also called a framing error or a reading frame shift) is a genetic mutation caused by indels ( insertions or deletions) of a number of nucleotides in a DNA sequence that is not divisible by three. Due to the triplet natur ...
s,
overlapping genes and incomplete genes.
In prokaryotes it's essential to consider
horizontal gene transfer
Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the movement of genetic material between organisms other than by the ("vertical") transmission of DNA from parent to offspring (reproduction). HGT is an important factor in the e ...
when searching for gene
sequence homology
Sequence homology is the homology (biology), biological homology between DNA sequence, DNA, RNA sequence, RNA, or Protein primary structure, protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments ...
. An additional important factor underused in current gene detection tools is existence of gene clusters —
operon
In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo splic ...
s (which are functioning units of
DNA
Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...
containing a cluster of
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
s under the control of a single
promoter) in both prokaryotes and eukaryotes. Most popular gene detectors treat each gene in isolation, independent of others, which is not biologically accurate.
''Ab initio'' methods
Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to ''
ab initio
( ) is a Latin term meaning "from the beginning" and is derived from the Latin ("from") + , ablative singular of ("beginning").
Etymology
, from Latin, literally "from the beginning", from ablative case of "entrance", "beginning", related t ...
'' gene finding, in which the
genomic
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
DNA sequence
A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nu ...
alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either ''signals'', specific sequences that indicate the presence of a gene nearby, or ''content'', statistical properties of the protein-coding sequence itself. ''Ab initio'' gene finding might be more accurately characterized as gene ''prediction'', since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.

In the genomes of
prokaryotes
A prokaryote (; less commonly spelled procaryote) is a single-celled organism whose cell lacks a nucleus and other membrane-bound organelles. The word ''prokaryote'' comes from the Ancient Greek (), meaning 'before', and (), meaning 'nut' ...
, genes have specific and relatively well-understood
promoter sequences (signals), such as the
Pribnow box
The Pribnow box (also known as the Pribnow-Schaller box) is a sequence of ''TATAAT'' of six nucleotides (thymine, adenine, thymine, etc.) that is an essential part of a promoter site on DNA for transcription to occur in bacteria.
It is an ideal ...
and
transcription factor
In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription (genetics), transcription of genetics, genetic information from DNA to messenger RNA, by binding t ...
binding site
In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...
s, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous
open reading frame
In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames ...
(ORF), which is typically many hundred or thousands of
base pair
A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...
s long. The statistics of
stop codon
In molecular biology, a stop codon (or termination codon) is a codon (nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the additio ...
s are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20–25 codons, or 60–75 base pairs, in a
random sequence The concept of a random sequence is essential in probability theory and statistics. The concept generally relies on the notion of a sequence of random variables and many statistical discussions begin with the words "let ''X''1,...,''Xn'' be independ ...
.) Furthermore, protein-coding DNA has certain
periodicities and other statistical properties that are easy to detect in a sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.
''Ab initio'' gene finding in
eukaryotes
The eukaryotes ( ) constitute the domain of Eukaryota or Eukarya, organisms whose cells have a membrane-bound nucleus. All animals, plants, fungi, seaweeds, and many unicellular organisms are eukaryotes. They constitute a major group of ...
, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are
CpG islands and binding sites for a
poly(A) tail.
Second,
splicing mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (
exons
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence i ...
), separated by non-coding sequences (
introns
An intron is any Nucleic acid sequence, nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of ...
). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.
Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex
probabilistic model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form ...
s, such as
hidden Markov model
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
s (HMMs) to combine information from a variety of different signal and content measurements. The
GLIMMER system is a widely used and highly accurate gene finder for prokaryotes.
GeneMark is another popular approach. Eukaryotic ''ab initio'' gene finders, by comparison, have achieved only limited success; notable examples are the
GENSCAN and
geneid programs. The GeneMark-ES and SNAP gene finders are GHMM-based like GENSCAN. They attempt to address problems related to using a gene finder on a genome sequence that it was not trained against. A few recent approaches like mSplicer, CONTRAST, or
mGene also use
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
techniques like
support vector machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
for successful gene prediction. They build a
discriminative model using
hidden Markov support vector machines or
conditional random field
Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without consi ...
s to learn an accurate gene prediction scoring function.
''Ab Initio'' methods have been benchmarked, with some approaching 100% sensitivity,
however as the sensitivity increases, accuracy suffers as a result of increased
false positives
A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test res ...
.
Other signals
Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like
k-mer
In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, ...
statistics,
Isochore (genetics) In genetics, an isochore is a large region of genomic DNA (greater than 300 kilobases) with a high degree of uniformity in GC content; that is, guanine (G) and cytosine (C) bases. The distribution of bases within a genome is non-random: different re ...
or
Compositional domain GC composition/uniformity/entropy, sequence and frame length, Intron/Exon/Donor/Acceptor/Promoter and
Ribosomal binding site vocabulary,
Fractal dimension
In mathematics, a fractal dimension is a term invoked in the science of geometry to provide a rational statistical index of complexity detail in a pattern. A fractal pattern changes with the Scaling (geometry), scale at which it is measured.
It ...
,
Fourier transform
In mathematics, the Fourier transform (FT) is an integral transform that takes a function as input then outputs another function that describes the extent to which various frequencies are present in the original function. The output of the tr ...
of a pseudo-number-coded DNA,
Z-curve parameters and certain run features.
It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of
secondary structure
Protein secondary structure is the local spatial conformation of the polypeptide backbone excluding the side chains. The two most common Protein structure#Secondary structure, secondary structural elements are alpha helix, alpha helices and beta ...
in the identification of regulatory motifs has been reported.
In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.
Neural networks
Artificial neural networks
In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a computational model inspired by the structure and functions of biological neural networks.
A neural network consists of connected ...
are computational models that excel at
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
and
pattern recognition
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their p ...
. Neural networks must be
trained with example data before being able to generalise for experimental data, and tested against benchmark data. Neural networks are able to come up with approximate solutions to problems that are hard to solve algorithmically, provided there is sufficient training data. When applied to gene prediction, neural networks can be used alongside other ''ab initio'' methods to predict or identify biological features such as splice sites.
One approach
involves using a sliding window, which traverses the sequence data in an overlapping manner. The output at each position is a score based on whether the network thinks the window contains a donor splice site or an acceptor splice site. Larger windows offer more accuracy but also require more computational power. A neural network is an example of a signal sensor as its goal is to identify a functional site in the genome.
Combined approaches
Programs such as Maker combine extrinsic and ''ab initio'' approaches by mapping protein and
EST data to the genome to validate ''ab initio'' predictions. Augustus, which may be used as part of the Maker pipeline, can also incorporate hints in the form of EST alignments or protein profiles to increase the accuracy of the gene prediction.
Comparative genomics approaches
As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a
comparative genomics
Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach c ...
approach.
This is based on the principle that the forces of
natural selection
Natural selection is the differential survival and reproduction of individuals due to differences in phenotype. It is a key mechanism of evolution, the change in the Heredity, heritable traits characteristic of a population over generation ...
cause genes and other functional elements to undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and TWINSCAN/N-SCAN and CONTRAST.
Multiple informants
TWINSCAN examined only human-mouse synteny to look for orthologous genes. Programs such as N-SCAN and CONTRAST allowed the incorporation of alignments from multiple organisms, or in the case of N-SCAN, a single alternate organism from the target. The use of multiple informants can lead to significant improvements in accuracy.
CONTRAST is composed of two elements. The first is a smaller classifier, identifying donor splice sites and acceptor splice sites as well as start and stop codons. The second element involves constructing a full model using machine learning. Breaking the problem into two means that smaller targeted data sets can be used to train the classifiers,
and that classifier can operate independently and be trained with smaller windows. The full model can use the independent classifier, and not have to waste computational time or model complexity re-classifying intron-exon boundaries. The paper in which CONTRAST is introduced proposes that their method (and those of TWINSCAN, etc.) be classified as ''de novo'' gene assembly, using alternate genomes, and identifying it as distinct from ''ab initio'', which uses a target 'informant' genomes.
Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise, GeneMapper and GeMoMa. Such techniques now play a central role in the annotation of all genomes.
Pseudogene prediction
Pseudogenes are close relatives of genes, sharing very high sequence homology, but being unable to code for the same
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
product. Whilst once relegated as byproducts of
gene sequencing, increasingly, as regulatory roles are being uncovered, they are becoming predictive targets in their own right.
Pseudogene prediction utilises existing sequence similarity and ab initio methods, whilst adding additional filtering and methods of identifying pseudogene characteristics.
Sequence similarity methods can be customised for pseudogene prediction using additional filtering to find candidate pseudogenes. This could use disablement detection, which looks for nonsense or frameshift mutations that would truncate or collapse an otherwise functional coding sequence.
Additionally, translating DNA into proteins sequences can be more effective than just straight DNA homology.
Content sensors can be filtered according to the differences in statistical properties between pseudogenes and genes, such as a reduced count of CpG islands in pseudogenes, or the differences in G-C content between pseudogenes and their neighbours. Signal sensors also can be honed to pseudogenes, looking for the absence of introns or polyadenine tails.
Metagenomic gene prediction
Metagenomics
Metagenomics is the study of all genetics, genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the mic ...
is the study of genetic material recovered from the environment, resulting in sequence information from a pool of organisms. Predicting genes is useful for
comparative metagenomics.
Metagenomics tools also fall into the basic categories of using either sequence similarity approaches (MEGAN4) and ab initio techniques (GLIMMER-MG).
Glimmer-MG
is an extension to
GLIMMER that relies mostly on an ab initio approach for gene finding and by using training sets from related organisms. The prediction strategy is augmented by classification and clustering gene data sets prior to applying ab initio gene prediction methods. The data is clustered by species. This classification method leverages techniques from metagenomic phylogenetic classification. An example of software for this purpose is, Phymm, which uses interpolated markov models—and PhymmBL, which integrates BLAST into the classification routines.
MEGAN4
uses a sequence similarity approach, using local alignment against databases of known sequences, but also attempts to classify using additional information on functional roles, biological pathways and enzymes. As in single organism gene prediction, sequence similarity approaches are limited by the size of the database.
FragGeneScan and MetaGeneAnnotator are popular gene prediction programs based on
Hidden Markov model
A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...
. These predictors account for sequencing errors, partial genes and work for short reads.
Another fast and accurate tool for gene prediction in metagenomes is MetaGeneMark.
This tool is used by the DOE Joint Genome Institute to annotate IMG/M, the largest metagenome collection to date.
See also
*
List of gene prediction software
*
Phylogenetic footprinting
*
Protein function prediction
*
Protein structure prediction
Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its Protein secondary structure, secondary and Protein tertiary structure, tertiary structure ...
*
Protein–protein interaction prediction Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is impo ...
*
Pseudogene (database)
Pseudogene is a database of pseudogene
Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Pseudogenes can be formed from both protein-coding genes and non-coding genes. In the case of protein-coding genes, most pseudo ...
*
Sequence mining
*
Sequence similarity (homology)
References
External links
AugustusFGENESH
GeMoMa- Homology-based gene prediction based on amino acid and intron position conservation as well as RNA-Seq data
geneidSGP2GlimmerGlimmerHMM
GenomeThreaderGeneMarkGismomGeneStarORF— A multi-platform and web tool for predicting ORFs and obtaining reverse complement sequence
- A portable and easily configurable genome annotation pipeline
{{DEFAULTSORT:Gene Prediction
Bioinformatics
Mathematical and theoretical biology
Markov models