metagenomics Metagenomics is the study of all genetics, genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the mic ...

, binning is the computational process of grouping assembled

contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to over ...

s and assigning them to their separate genomes of origin. Binning methods can be based on either compositional sequence features (such as

GC-content In molecular biology and genetics, GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C). This measure indicates the proportion of G and C bases out of ...

or tetranucleotide frequencies) or sequence read mapping coverage across samples, or both.

Introduction

Metagenomic samples typically consist of sequencing data from many unrelated organisms, as they are

environmental Environment most often refers to: __NOTOC__ * Natural environment, referring respectively to all living and non-living things occurring naturally and the physical and biological factors along with their chemical interactions that affect an organism ...

in origin, and composed of the DNA from the whole

community A community is a social unit (a group of people) with a shared socially-significant characteristic, such as place, set of norms, culture, religion, values, customs, or identity. Communities may share a sense of place situated in a given g ...

microorganisms A microorganism, or microbe, is an organism of microscopic size, which may exist in its single-celled form or as a colony of cells. The possible existence of unseen microbial life was suspected from antiquity, with an early attestation in ...

contained within an environmental sample. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome. Metagenomic assemblies are typically fragmented in the form of many contigs, especially in short-read assemblies where repeats and integrative elements can be difficult to resolve. Thus, binning occurs post-metagenomic assembly and represents the effort to associated fragmented contigs back with a genome of origin, termed a Metagenome Assembled Genome (MAG). Taxonomy of MAGs can then be inferred through placement into a reference phylogenetic tree using algorithms like

GTDB The Genome Taxonomy Database (GTDB) is an online database that maintains information on a proposed nomenclature of prokaryotes, following a phylogenomic approach based on a set of conserved single-copy proteins. In addition to resolving paraphyl ...

-Tk. The first studies that sampled DNA from multiple organisms used specific genes to assess diversity and origin of each sample. These

marker gene In biology, a marker gene may have several meanings. In nuclear biology and molecular biology, a marker gene is a gene used to determine if a nucleic acid sequence has been successfully inserted into an organism's DNA. In particular, there are tw ...

s had been previously sequenced from clonal cultures from known organisms, so, whenever one of such genes appeared in a read or contig from the metagenomic sample that read could be assigned to a known species or to the OTU of that species. The problem with this method was that only a tiny fraction of the sequences carried a marker gene, leaving most of the data unassigned. Modern binning techniques use both previously available information independent from the sample and intrinsic information present in the sample. Depending on the diversity and complexity of the sample, their degree of success vary: in some cases they can resolve the sequences up to individual species, while in some others the sequences are identified at best with very broad taxonomic groups. Binning of metagenomic data from various habitats might significantly extend the tree of life. Such approach on globally available metagenomes binned 52 515 individual microbial genomes and extended diversity of

bacteria Bacteria (; : bacterium) are ubiquitous, mostly free-living organisms often consisting of one Cell (biology), biological cell. They constitute a large domain (biology), domain of Prokaryote, prokaryotic microorganisms. Typically a few micr ...

and

archaea Archaea ( ) is a Domain (biology), domain of organisms. Traditionally, Archaea only included its Prokaryote, prokaryotic members, but this has since been found to be paraphyletic, as eukaryotes are known to have evolved from archaea. Even thou ...

by 44%.

Algorithms

Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments against

databases In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...

, and try to separate sequence based in organism-specific characteristics of the DNA, like

. Some prominent binning algorithms for metagenomic datasets obtained through shotgun sequencing include TETRA, MEGAN, Phylopythia, SOrt-ITEMS, and DiScRIBinATE, among others.

TETRA

TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments. There are four possible nucleotides in

DNA Deoxyribonucleic acid (; DNA) is a polymer composed of two polynucleotide chains that coil around each other to form a double helix. The polymer carries genetic instructions for the development, functioning, growth and reproduction of al ...

, therefore there can be

4^4=256

different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies

z-scores In statistics, the standard score or ''z''-score is the number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores ...

are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.

MEGAN

In the DIAMOND+MEGAN approach, all reads are first aligned against a protein reference database, such as NCBI-nr, and then the resulting alignments are analyzed using the naive LCA algorithm, which places a read on the lowest taxonomic node in the NCBI taxonomy that lies above all taxa to which the read has a significant alignment. Here, an alignment is usually deemed "significant", if its bit score lies above a given threshold (which depends on the length of the reads) and is within 10%, say, of the best score seen for that read. The rationale of using protein reference sequences, rather than DNA reference sequences, is that current DNA reference databases only cover a small fraction of the true diversity of genomes that exist in the environment.

Phylopythia

Phylopythia is one supervised classifier developed by researchers at IBM labs, and is basically a

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

trained with DNA k-mers from known sequences.

SOrt-ITEMS

SOrt-ITEMS is an alignment-based binning algorithm developed by Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. Users need to perform a similarity search of the input metagenomic sequences (reads) against the nr protein database using BLASTx search. The generated BLASTx output is then taken as input by the SOrt-ITEMS program. The method uses a range of BLAST alignment parameter thresholds to first identify an appropriate taxonomic level (or rank) where the read can be assigned. An orthology-based approach is then adopted for the final assignment of the metagenomic read. Other alignment-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) include DiScRIBinATE, ProViDE and SPHINX. The methodologies of these algorithms are summarized below.

DiScRIBinATE

DiScRIBinATE is an alignment-based binning algorithm developed by the Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. Incorporating this alternate strategy was observed to reduce the binning time by half without any significant loss in the accuracy and specificity of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE was seem to reduce the overall misclassification rate.

ProViDE

ProViDE is an alignment-based binning approach developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. for the estimation of viral diversity in metagenomic samples. ProViDE adopts the reverse orthology based approach similar to SOrt-ITEMS for the taxonomic classification of metagenomic sequences obtained from virome datasets. It a customized set of BLAST parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom.

PCAHIER

PCAHIER, another binning algorithm developed by the Georgia Institute of Technology., employs n-mer oligonucleotide frequencies as the features and adopts a hierarchical classifier (PCAHIER) for binning short metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).

SPHINX

SPHINX, another binning algorithm developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd., adopts a hybrid strategy that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. The approach was designed with the objective of analyzing metagenomic datasets as rapidly as composition-based approaches, but nevertheless with the accuracy and specificity of alignment-based algorithms. SPHINX was observed to classify metagenomic sequences as rapidly as composition-based algorithms. In addition, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX was observed to be comparable with results obtained using alignment-based algorithms.

INDUS and TWARIT

Represent other composition-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. These algorithms utilize a range of oligonucleotide compositional (as well as statistical) parameters to improve binning time while maintaining the accuracy and specificity of taxonomic assignments.

References

Metagenomics Bioinformatics algorithms