Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

tools. The mining process relies on a huge amount of data (represented by

DNA sequences A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA (using GACT) or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the ...

and annotations) accessible in genomic

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

s. By applying data mining

algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...

s, the data can be used to generate new knowledge in several areas of

medicinal chemistry Medicinal or pharmaceutical chemistry is a scientific discipline at the intersection of chemistry and pharmacy involved with drug design, designing and developing pharmaceutical medication, drugs. Medicinal chemistry involves the identification, ...

, such as discovering novel

natural product A natural product is a natural compound or substance produced by a living organism—that is, found in nature. In the broadest sense, natural products include any substance produced by life. Natural products can also be prepared by chemical s ...

History

In the mid- to late 1980s, researchers have increasingly focused on genetic studies with the advancing sequencing technologies. The GenBank database was established in 1982 for the collection, management, storage, and distribution of DNA sequence data due to the increasing availability of DNA sequences. With the increasing number of genetic data, biotechnological companies have been able to use human DNA sequence to develop protein and antibody drugs through genome mining since 1992. In the late 1990s, many companies, such as Amgen, Immunec, Genentech were able to develop drugs that progressed to the clinical stage by adopting genome mining. Since the

Human Genome Project The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a ...

was completed in the early 2000, researchers have been sequencing the genomes of many

microorganism A microorganism, or microbe, is an organism of microscopic scale, microscopic size, which may exist in its unicellular organism, single-celled form or as a Colony (biology)#Microbial colonies, colony of cells. The possible existence of unseen ...

s. Subsequently, many of these genomes have been carefully studied to identify new genes and biosynthetic pathways.

Algorithms

As large quantities of genomic sequence data began to accumulate in public databases,

genetic algorithm In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to g ...

s became important to decipher the enormous collection of genomic data. They are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection. The followings are commonly used genetic algorithms: * AntiSMASH (Antibiotics and Secondary Metabolite Analysis Shell) addresses secondary metabolite genome pipelines. *BiGSCAPE Large-scale network analysis and classification of Biosynthetic Gene Clusters. * PRISM (Prediction Informatics for Secondary Metabolites) is a combinatorial approach to chemical structure prediction for genetically encoded nonribosomal peptides and type I and II polyketides. * SIM (Statistically based sequence similarity) method, such as

FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...

PSI-BLAST In bioinformatics, BLAST (basic local alignment search tool) is an algorithm and program for comparing Primary structure, primary biological sequence information, such as the amino acid, amino-acid sequences of proteins or the nucleotides of DNA ...

, infer orthologous homology. * BLAST (Basic local alignment search tool) is an approach for rapid sequence comparison.

Applications

Genome mining applies on the discovery of natural product by facilitating the characterization of novel molecules and biosynthetic pathways.

Natural product discovery

The production of

s is regulated by the biosynthetic

gene cluster A gene cluster is a group of two or more genes found within an organism's DNA that encode similar peptide, polypeptides or proteins which collectively share a generalized function and are often located within a few thousand base pairs of each othe ...

s (BGCs) encoded in the microorganism. By adopting genome mining, the BGCs that produce the target natural product can be predicted. Some important enzymes responsible for the formation of natural products are

polyketide In organic chemistry, polyketides are a class of natural products derived from a Precursor (chemistry), precursor molecule consisting of a Polymer backbone, chain of alternating ketone (, or Carbonyl reduction, its reduced forms) and Methylene gro ...

synthases (PKS), non-ribosomal peptide synthases (NRPS), ribosomally and post-translationally modified peptides (RiPPs), and

terpenoid The terpenoids, also known as isoprenoids, are a class of naturally occurring organic compound, organic chemicals derived from the 5-carbon compound isoprene and its derivatives called terpenes, diterpenes, etc. While sometimes used interchangeabl ...

s, and many more. Mining for enzymes, researchers can figure out the classes that BGCs encode and compare target gene clusters to known gene clusters. To verify the relation between the BGCs and natural products, the target BGCs can be expressed by suitable host through the use of

molecular cloning Molecular cloning is a set of experimental methods in molecular biology that are used to assemble recombinant DNA molecules and to direct their DNA replication, replication within Host (biology), host organisms. The use of the word ''cloning'' re ...

Databases and tools

Genetic data has been accumulated in databases. Researchers are able to utilize algorithms to decipher the data accessible from databases for the discovery of new processes, targets, and products. The following are databases and tools: *

GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...

database provides genomic datasets for analysis. *

UCSC Genome Browser The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate spec ...

* AntiSMASH-DB allows comparing the sequences of newly sequenced BGCs against those of previously predicted and experimentally characterized ones. * BIG-FAM is a biosynthetic gene cluster family database. * DoBISCUIT is a database of secondary metabolite biosynthetic gene clusters. * MIBiG (Minimum Information about a Biosynthetic Gene cluster specification) provides a standard for annotations and metadata on biosynthetic gene clusters and their molecular products. * Interactive tree of life (iTOL) is a web-based tool for the display, manipulation and annotation of phylogenetic trees.

References

{{Reflist Medicinal chemistry DNA

Mining Mining is the Resource extraction, extraction of valuable geological materials and minerals from the surface of the Earth. Mining is required to obtain most materials that cannot be grown through agriculture, agricultural processes, or feasib ...