PICRUSt
   HOME

TheInfoList



OR:

PICRUSt is a
bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
software package. The name is an abbreviation for ''Phylogenetic Investigation of Communities by Reconstruction of Unobserved States.'' The tool serves in the field of
metagenomic Metagenomics is the study of all genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the microbial co ...
analysis where it allows inference of the functional profile of a
microbial A microorganism, or microbe, is an organism of microscopic size, which may exist in its single-celled form or as a colony of cells. The possible existence of unseen microbial life was suspected from antiquity, with an early attestation in ...
community based on
marker gene In biology, a marker gene may have several meanings. In nuclear biology and molecular biology, a marker gene is a gene used to determine if a nucleic acid sequence has been successfully inserted into an organism's DNA. In particular, there are tw ...
survey along one or more samples. In essence, PICRUSt takes a user supplied
operational taxonomic unit An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, wh ...
table (typically referred to as an OTU table), representing the marker gene sequences (most commonly a 16S
cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Cluster II (spacecraft), a European Space Agency mission to study the magnetosphere * Asteroid cluster, a small ...
) accompanied with its relative abundance in each of the samples. The output of PICRUSt is a sample by functional-gene-count matrix, telling the count of each functional-gene in each of the samples surveyed. The ability of PICRUSt to estimate the functional-gene profile for a given sample relies on a set of known sequenced
genomes A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...
. This could also be thought of as an automated alternative to manually researching the gene families likely to be present in organisms whose sequences are found in a
16S ribosomal RNA 16S ribosomal RNA (or 16 S rRNA) is the RNA component of the 30S subunit of a prokaryotic ribosome ( SSU rRNA). It binds to the Shine-Dalgarno sequence and provides most of the SSU structure. The genes coding for it are referred to as 16S ...
amplicon library. The below description corresponds to the original version of PICRUSt, but a major update to this tool is currently being developed.


Genome prediction algorithm

In an initial preprocessing phase, PICRUSt constructs confidence intervals and point predictions for the number of copies of each gene family in each bacterial and archaeal strain in a reference tree, using organisms with sequenced genomes as a reference. More specifically, for each gene family, PICRUSt maps known gene copy numbers (from complete sequenced genomes) onto a reference tree of life. These gene family copy numbers are treated as continuous traits, and an
evolutionary model A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are ...
constructed under the assumption of
Brownian Motion Brownian motion is the random motion of particles suspended in a medium (a liquid or a gas). The traditional mathematical formulation of Brownian motion is that of the Wiener process, which is often called Brownian motion, even in mathematical ...
. These evolutionary models can be constructed with either
Maximum Likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...
, Relaxed Maximum Likelihood or Wagner Parsimony This evolutionary model is then used to predict both a point estimate and a confidence interval for the copy number of microorganisms without sequenced genomes. This 'genome prediction' step produces a large table of bacterial types (specifically
operational taxonomic unit An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, wh ...
or OTUs) vs. gene family copy numbers. This table is distributed to end users. It is important to note that this prediction method is not the same as a nearest neighbor approach (i.e. just looking up the nearest sequenced genome), and was shown to give a small but significant improvement in accuracy over that strategy. However, nearest neighbor prediction is available as an option in PICRUSt. Notably, while this functionality is typically used for prediction of gene copy numbers in bacteria, it could, in principle, be used for prediction of any other continuous trait given trait data for diverse organisms and a reference
phylogeny A phylogenetic tree or phylogeny is a graphical representation which shows the evolutionary history between a set of species or Taxon, taxa during a specific time.Felsenstein J. (2004). ''Inferring Phylogenies'' Sinauer Associates: Sunderland, M ...
. Langille et al. tested the accuracy of this genome prediction step using leave-one-out cross validation on the input set of sequenced genomes. Additional tests examined sensitivity to errors in
phylogenetic inference Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing op ...
, lack of genomic data, and the accuracy of the confidence intervals on gene content. A similar step predicts the copy number of
16S rRNA 16S ribosomal RNA (or 16Svedberg, S rRNA) is the RNA component of the 30S subunit of a prokaryotic ribosome (SSU rRNA). It binds to the Shine-Dalgarno sequence and provides most of the SSU structure. The genes coding for it are referred to as ...
genes.


Metagenome prediction algorithm

When applying PICRUSt to a
16S rRNA 16S ribosomal RNA (or 16Svedberg, S rRNA) is the RNA component of the 30S subunit of a prokaryotic ribosome (SSU rRNA). It binds to the Shine-Dalgarno sequence and provides most of the SSU structure. The genes coding for it are referred to as ...
gene library, PICRUSt matches reference
operational taxonomic unit An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, wh ...
s against the tables, and retrieves a predicted 16S rRNA copy number and gene copy number for each gene family. The abundance of each OTU is divided by its predicted copy number (if a bacterium has multiple 16S copies, its apparent abundance in 16S rRNA data will be inflated), and then multiplied by the copy number of the gene family. This gives a prediction for the contribution of each OTU to the overall gene content of the sample (the
metagenome Metagenomics is the study of all genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the microbial co ...
). Finally, these individual contributions are summed together to produce an estimate of the genes present in the
metagenome Metagenomics is the study of all genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the microbial co ...
. Langille et al., 2013 tested the accuracy of this genome prediction step by using previously reported datasets in which the same biological sample was subjected to 16S rRNA gene amplification and shotgun metagenomics. In these cases, the shotgun metagenomic results were taken as a representation of the 'true' community, and the 16S rRNA gene amplicon libraries fed into PICRUSt to attempt to predict those data. Test datasets included
human microbiome The human microbiome is the aggregate of all microbiota that reside on or within human tissues and biofluids along with the corresponding List of human anatomical features, anatomical sites in which they reside, including the human gastrointes ...
samples from the
Human Microbiome Project The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on i ...
, soil samples, diverse mammalian samples, and samples from the
Guerrero Negro Guerrero Negro (English: Black Warrior) is the largest town located in the municipality of Mulegé in the Mexican state of Baja California Sur (BCS). It had a population of 13,596 in the 2020 census. The town is served by Guerrero Negro Airport. ...
microbial mats A microbial mat is a multi-layered sheet or biofilm of microbial colony (biology), colonies, composed of mainly bacteria and/or archaea. Microbial mats grow at interface (chemistry), interfaces between different types of material, mostly on submer ...


The Nearest Sequenced Taxon Index

Because PICRUSt, and evolutionary comparative genomics in general, depends on sequenced genomes, biological samples from well-studied environments (many sequenced genomes) will be better predicted than poorly studied environments. In order to assess how many genomes are available, PICRUSt optionally allows users to calculate a Nearest Sequenced Taxon Index (NSTI) for their samples. This index reflects the average phylogenetic distance between each
16S rRNA 16S ribosomal RNA (or 16Svedberg, S rRNA) is the RNA component of the 30S subunit of a prokaryotic ribosome (SSU rRNA). It binds to the Shine-Dalgarno sequence and provides most of the SSU structure. The genes coding for it are referred to as ...
gene sequence in their sample, and a 16S rRNA gene sequence from a fully sequenced genome. In general, the lower the NSTI score, the more accurate PICRUSt's predictions are expected to be. For example, showed that PICRUSt was much more accurate on diverse soil samples and samples from the
Human Microbiome Project The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on i ...
than on microbial mat samples from
Guerrero Negro Guerrero Negro (English: Black Warrior) is the largest town located in the municipality of Mulegé in the Mexican state of Baja California Sur (BCS). It had a population of 13,596 in the 2020 census. The town is served by Guerrero Negro Airport. ...
, which contained many bacteria without any sequenced relatives.


Related tools

Okuda et al., 2012 published a similar method that used a bounded k-Nearest Neighbor approach to predict virtual metagenomes. They validated their approach using 16S rRNA gene sequences extracted from shotgun metagenomes, and compared the predictions of their method against the full metagenome. CopyRighter, like PICRUSt, uses evolutionary modeling and phylogenetic trait prediction to estimate 16S rRNA gene sequence copy numbers for each bacterial and archaeal type in a sample, and then uses these estimates to correct estimates of community composition. PanFP presented a similar method, but based on genome predictions for each taxonomic group. Benchmarking showed highly similar performance to PICRUSt when compared on the same datasets. One advantage is that all OTUs, not just those in a reference phylogeny table can be used. One disadvantage is that confidence intervals and evolutionary models are not constructed. PAPRICA is a metagenome prediction tool based on placing input 16S rRNA gene sequences into a known phylogenetic tree based corresponding to reference genomes. The main prediction output corresponds to
Enzyme Commission number The Enzyme Commission number (EC number) is a numerical classification scheme for enzymes, based on the chemical reactions they catalyze. As a system of enzyme nomenclature, every EC number is associated with a recommended name for the correspon ...
s. Piphillin is a tool produced by the company Second Genome that produces metagenome predictions based on nearest-neighbour clustering of input 16S rRNA gene sequences with 16S rRNA gene sequences from reference genomes. There is a web portal for running this tool on the Second Genome website. This tool is under continual development and undergoing validation as summarized in a 2020 publication. Tax4Fun is a similar tool based on linking the 16S ribosomal RNA genes from all
KEGG KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis ...
organisms with 16S rRNA gene sequences found in the SILVA ribosomal RNA database. Originally this tool was restricted to 16S rRNA gene sequences found within the SILVA database. However, the latest version of this tool, Tax4Fun2, can be used with OTUs or amplicon sequence variants from any clustering pipeline.


References

{{reflist, 2 Metagenomics Bioinformatics software Environmental microbiology