GeneMark is a generic name for a family of

ab initio ( ) is a Latin term meaning "from the beginning" and is derived from the Latin ("from") + , ablative singular of ("beginning"). Etymology , from Latin, literally "from the beginning", from ablative case of "entrance", "beginning", related t ...

gene prediction algorithms and software programs developed at the

Georgia Institute of Technology The Georgia Institute of Technology (commonly referred to as Georgia Tech, GT, and simply Tech or the Institute) is a public university, public research university and Institute of technology (United States), institute of technology in Atlanta, ...

Atlanta Atlanta ( ) is the List of capitals in the United States, capital and List of municipalities in Georgia (U.S. state), most populous city in the U.S. state of Georgia (U.S. state), Georgia. It is the county seat, seat of Fulton County, Georg ...

. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of ''

Haemophilus influenzae ''Haemophilus influenzae'' (formerly called Pfeiffer's bacillus or ''Bacillus influenzae'') is a Gram-negative, Motility, non-motile, Coccobacillus, coccobacillary, facultative anaerobic organism, facultatively anaerobic, Capnophile, capnophili ...

'', and in 1996 for the first archaeal genome of '' Methanococcus jannaschii''. The algorithm introduced

inhomogeneous Homogeneity and heterogeneity are concepts relating to the uniformity of a substance, process or image. A homogeneous feature is uniform in composition or character (i.e., color, shape, size, weight, height, distribution, texture, language, i ...

three-periodic

Markov chain In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...

models of protein-coding

DNA sequence A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nu ...

that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying

genetic code Genetic code is a set of rules used by living cell (biology), cells to Translation (biology), translate information encoded within genetic material (DNA or RNA sequences of nucleotide triplets or codons) into proteins. Translation is accomplished ...

) in each of six possible reading frames (including three frames in the

complementary DNA In genetics, complementary DNA (cDNA) is DNA that was reverse transcribed (via reverse transcriptase) from an RNA (e.g., messenger RNA or microRNA). cDNA exists in both single-stranded and double-stranded forms and in both natural and engin ...

strand) or being "non-coding". The original GeneMark (developed before the advent of the HMM applications in Bioinformatics) was an HMM-like algorithm; it could be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM model of DNA sequence.

Further improvements in the algorithms for gene prediction in prokaryotic genomes

The GeneMark.hmm algorithm (1998) was designed to improve accuracy of prediction of short genes and gene starts. The idea was to use the inhomogeneous Markov chain models introduced in GeneMark for computing likelihoods of the sequences emitted by the states of a

hidden Markov model A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent (or ''hidden'') Markov process (referred to as X). An HMM requires that there be an observable process Y whose outcomes depend on the outcomes of X ...

, or rather semi-Markov HMM, or generalized HMM describing the genomic sequence. The borders between coding and non-coding regions were formally interpreted as transitions between hidden states. Additionally, the

ribosome Ribosomes () are molecular machine, macromolecular machines, found within all cell (biology), cells, that perform Translation (biology), biological protein synthesis (messenger RNA translation). Ribosomes link amino acids together in the order s ...

binding site In biochemistry and molecular biology, a binding site is a region on a macromolecule such as a protein that binds to another molecule with specificity. The binding partner of the macromolecule is often referred to as a ligand. Ligands may includ ...

model was added to the GHMM model to improve accuracy of gene start prediction. The next important step in the algorithm development was introduction of self-training or unsupervised training of the model parameters in the new gene prediction tool GeneMarkS (2001). Rapid accumulation of prokaryotic genomes in the following years has shown that the structure of sequence patterns related to gene expression regulation signals near gene starts may vary. Also, it was observed that prokaryotic genome may exhibit GC content variability due to the lateral gene transfer. The new algorithm, GeneMarkS-2 was designed to make automatic adjustments to the types of gene expression patterns and the GC content changes along the genomic sequence. GeneMarkS and, then GeneMarkS-2 have been used in the NCBI pipeline for prokaryotic genomes annotation (PGAP). ().

Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes

Accurate identification of species specific parameters of a gene finding algorithm is a necessary condition for making accurate gene predictions. However, in the studies of viral genomes one needs to estimate parameters from a rather short sequence that has no large genomic context. Importantly, starting 2004, the same question had to be addressed for gene prediction in short metagenomic sequences. A surprisingly accurate answer was found by introduction of parameter generating functions depending on a single variable, the sequence G+C content ("heurisic method" 1999). Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method in 2010 (implemented in MetaGeneMark). Further on, the need to predict genes in RNA transcripts led to development of GeneMarkS-T (2015), a tool that identifies intron-less genes in long transcript sequences assembled from RNA-Seq reads.

Eukaryotic gene prediction

In eukaryotic genomes modeling of

exon An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequence ...

borders with introns and intergenic regions present a major challenge. The GHMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons,

intron An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word ''intron'' is derived from the term ''intragenic region'', i.e., a region inside a gene."The notion of the cistron .e., gen ...

intergenic region An intergenic region is a stretch of DNA sequences located between genes. Intergenic regions may contain functional elements and junk DNA. Properties and functions Intergenic regions may contain a number of functional DNA sequences such as p ...

s and single exon genes located in both DNA strands. Initial version of the eukaryotic GeneMark.hmm needed manual compilation of training sets of protein-coding sequences for estimation of the algorithm parameters. However, in 2005, the first self-training eukaryotic gene finder, GeneMark-ES, was developed. A fungal version of GeneMark-ES developed in 2008 features a more complex intron model and hierarchical strategy of self-training. In 2014, in GeneMark-ET the self-training of parameters was aided by extrinsic hints generated by mapping to the genome short RNA-Seq reads. Extrinsic evidence is not limited to the 'native' RNA sequences. The cross-species proteins collected in the vast protein databases could be a source for external hints, if the homologous relationships between the already known proteins and the proteins encoded by yet unknown genes in the novel genome are established. This task was solved upon developing the new algorithm, GeneMark-EP+ (2020). Integration of the RNA and protein sources of the intrinsic hints was done in GeneMark-ETP (2023). Versatility and accuracy of the eukaryotic gene finders of the GeneMark family have led to their incorporation into number of pipelines of genome annotation. Also, since 2016, the pipelines BRAKER1, BRAKER2, BRAKER3 were developed to combine the strongest features of GeneMark and AUGUSTUS. Notably, gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)

GeneMark Family of Gene Prediction Programs

Bacteria, Archaea

* GeneMark * GeneMarkS * GeneMarkS-2

Metagenomes and Metatranscriptomes

* MetaGeneMark * GeneMarkS-T

Eukaryotes

* GeneMark * GeneMark.hmm * GeneMark-ES: ab initio gene finding algorithm for eukaryotic genomes with automatic (unsupervised) training. * GeneMark-ET: augments GeneMark-ES by integrating RNA-Seq read alignments into the self-training procedure. * GeneMark-EP+: augments GeneMark-ES by iterative finding genes in a novel genome, detecting similarities of predicted genes to known proteins, splice-aligning of the known proteins to the genome and generating hints for the next round of prediction, and correction based on the external evidence. * GeneMark-ETP: integrates genomic, transcript and protein evidence into the gene prediction

Viruses, phages and plasmids

* Heuristic models

Transcripts assembled from RNA-Seq read

* GeneMarkS-T

References

* Borodovsky M. and McIninch J
"GeneMark: parallel gene recognition for both DNA strands."
''Computers & Chemistry'' (1993) 17 (2): 123–133
DOI
* Lukashin A. and Borodovsky M
"GeneMark.hmm: new solutions for gene finding."
''Nucleic Acids Research'' (1998) 26 (4): 1107–1115
DOIPMID
* Besemer J. and Borodovsky M
"Heuristic approach to deriving models for gene finding."
''Nucleic Acids Research'' (1999) 27 (19): 3911–3920
DOIPMID
* Besemer J., Lomsadze A., and Borodovsky M
"GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions."
''Nucleic Acids Research'' (2001) 29 (12): 2607–2618
DOIPMID
* Mills R., Rozanov M., Lomsadze A., Tatusova T., and Borodovsky M.
"Improving gene annotation in complete viral genomes."
''Nucleic Acids Research'' (2003) 31 (23): 7041–7055.
DOIPMID
* Besemer J. and Borodovsky M
"GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses."
''Nucleic Acids Research'' (2005) 33 (Web Server Issue): W451-454
DOIPMID
* Lomsadze A., Ter-Hovhannisyan V., Chernoff Y., and Borodovsky M
"Gene identification in novel eukaryotic genomes by self-training algorithm."
''Nucleic Acids Research'' (2005) 33 (20): 6494–6506
DOIPMID
* Ter-Hovhannisyan V., Lomsadze A., Chernoff Y., and Borodovsky M
"Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training."
''Genome Research'' (2008) 18 (12): 1979-1990
DOIPMID
* Zhu W., Lomsadze A., and Borodovsky M
"Ab initio gene identification in metagenomic sequences."
''Nucleic Acids Research'' (2010) 38 (12): e132
DOIPMID
* Lomsadze A., Burns P.D., and Borodovsky M
"Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm."
''Nucleic Acids Research'' (2014) 42 (15): e119
DOIPMID
* Tang S., Lomsadze A., and Borodovsky M
"Identification of protein coding regions in RNA transcripts."
''Nucleic Acids Research'' (2015) 43 (12): e78
DOIPMID
* Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E., Zaslavsky L., Lomsadze A., Pruitt K., Borodovsky M., and Ostell J
"NCBI prokaryotic genome annotation pipeline."
''Nucleic Acids Research'' (2016) 44 (14): 6614-6624
DOIPMID
* Hoff K., Lange S., Lomsadze A., Borodovsky M., and Stanke M
"BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS."
''Bioinformatics'' (2016) 32 (5): 767-769
DOIPMID
* Lomsadze A., Gemayel K., Tang S., and Borodovsky M
"Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes."
''Genome Research'' (2018) 28 (7): 1079-1089
DOIPMID
* Bruna T., Hoff K., Lomsadze A., Stanke M., and Borodovsky M
"BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database."
''NAR Genomics and Bioinformatics'' (2021) 3 (1): lqaa10
DOIPMID
* Bruna T., Lomsadze A., and Borodovsky M
"GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins."
''NAR Genomics and Bioinformatics'' (2022) 2 (2): lqaa02
DOIPMID
* Bruna T., Lomsadze A., and Borodovsky M
"GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistence with Extrinsic Data."
''bioRxiv'' (Jan 5, 2023
DOIPMID
* Gabriel L., Brůna T., Hoff K., Ebel M., Lomsadze A., Borodovsky M., and Stanke M.
"BRAKER3: Fully automated genome annotation using RNA-Seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA."
''bioRxiv'' (Nov 27, 2023
DOIPMID

External links

* {{genomics-footer Metagenomics software Mathematical and theoretical biology Genomics Bioinformatics software zh:基因识别