Metatranscriptomics is the set of techniques used to study
gene
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protei ...
expression of
microbes
A microorganism, or microbe, is an organism of microscopic size, which may exist in its single-celled form or as a colony of cells. The possible existence of unseen microbial life was suspected from antiquity, with an early attestation in ...
within natural environments, i.e., the metatranscriptome.
While
metagenomics
Metagenomics is the study of all genetics, genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the mic ...
focuses on studying the genomic content and on identifying which microbes are present within a community, metatranscriptomics can be used to study the diversity of the active genes within such community, to quantify their expression levels and to monitor how these levels change in different conditions (e.g., physiological vs. pathological conditions in an organism). The advantage of metatranscriptomics is that it can provide information about differences in the active functions of
microbial communities that would otherwise appear to have similar make-up.
Introduction
The
microbiome
A microbiome () is the community of microorganisms that can usually be found living together in any given habitat. It was defined more precisely in 1988 by Whipps ''et al.'' as "a characteristic microbial community occupying a reasonably wel ...
has been defined as a microbial community occupying a well-defined habitat. These communities are ubiquitous and can play a key role in maintenance of the characteristics of their environment, and an imbalance in these communities can negatively affect the activities of the setting in which they reside. To study these communities, and to then determine their impact and correlation with their niche, different
omics
Omics is the collective characterization and quantification of entire sets of biological molecules and the investigation of how they translate into the structure, function, and dynamics of an organism or group of organisms. The branches of scien ...
approaches have been used. While metagenomics can help researchers generate a ''taxonomic'' profile of the sample, metatranscriptomics provides a ''functional'' profile by analysing which genes are expressed by the community. It is possible to infer what genes are expressed under specific conditions, and this can be done using functional annotations of expressed genes.
Function
Since metatranscriptomics focuses on what genes are expressed, it enables the characterization of the active functional profile of the entire microbial community. The overview of the gene expression in a given sample is obtained by capturing the total
mRNA
In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of Protein biosynthesis, synthesizing a protein.
mRNA is ...
of the microbiome and performing whole-metatranscriptomics
shotgun sequencing
In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.
The Sanger sequencing#Method, chain-termination method of DNA sequencin ...
.
Tools and techniques
Although
microarray
A microarray is a multiplex (assay), multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of biological interactions. It is a two-dimensional array on a Substrate (materials science), solid substrate—usu ...
s can be exploited to determine the gene expression profiles of some model organisms,
next-generation sequencing
Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation ...
and
third-generation sequencing Third-generation sequencing (also known as long-read sequencing) is a class of DNA sequencing methods that have the capability to produce substantially longer reads (ranging from 10 kb to >1 Mb in length) than second generation sequencing, also kno ...
are the preferred techniques in metatranscriptomics. The protocol that is used to perform a metatranscriptome analysis may vary depending on the type of sample that needs to be analysed. Indeed, many different protocols have been developed for studying the metatranscriptome of microbial samples. Generally, the steps include sample harvesting,
RNA extraction
RNA extraction is the purification of RNA from biological samples. This procedure is complicated by the ubiquitous presence of ribonuclease enzymes in cells and tissues, which can rapidly degrade RNA. Several methods are used in molecular biology t ...
(different extraction methods for different kinds of samples have been reported in the literature), mRNA enrichment, cDNA synthesis and preparation of metatranscriptomic libraries, sequencing and data processing and analysis. mRNA enrichment is one of the most technically challenging steps, for which different strategies have been proposed:
* removing
rRNA
Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
through Ribosomal RNA capture
* using a 5-3 exonuclease to degrade processed RNAs (mostly rRNA and
tRNA
Transfer ribonucleic acid (tRNA), formerly referred to as soluble ribonucleic acid (sRNA), is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes). In a cell, it provides the physical link between the gene ...
)
* adding poly(A) to mRNAs by using a polyA polymerase (in
E. coli
''Escherichia coli'' ( )Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. is a gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus ''Escherichia'' that is commonly foun ...
)
* using antibodies to capture mRNAs that bind to specific
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residue (biochemistry), residues. Proteins perform a vast array of functions within organisms, including Enzyme catalysis, catalysing metab ...
s
The last two strategies are not recommended as they have been reported to be highly biased.
Computational analysis
A typical metatranscriptome analysis pipeline:
*maps reads to a reference genome, or
*performs de novo assembly of the reads into transcript contigs and supercontigs
The first strategy maps reads to reference genomes in databases, to collect information that is useful to deduce the relative expression of the single genes. Metatranscriptomic reads are mapped against databases using alignment tools, such as
Bowtie2, BWA, and
BLAST. Then, the results are annotated using resources, such as
GO,
KEGG
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis ...
, COG, and
Swiss-Prot
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
. The final analysis of the results is carried out depending on the aim of the study. One of the latest metatranscriptomics techniques is
stable isotope probing (SIP), which has been used to retrieve specific targeted transcriptomes of
aerobic microbes in lake sediment.
The limitation of this strategy is its reliance on the information of reference genomes in databases.
The second strategy retrieves the abundance in the expression of the different genes by assembling metatranscriptomic reads into longer fragments called
contig
A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005.
In bottom-up sequencing projects, a contig refers to over ...
s using different software. The
Trinity software for
RNA-seq
RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...
, in comparison with other de novo transcriptome assemblers, was reported to recover more full-length transcripts over a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. This is particularly important in the absence of a reference genome.
A quantitative pipeline for transcriptomic analysis was developed by Li and Dewey and called RSEM (RNA-Seq by Expectation Maximization). It can work as stand-alone software or as a plug-in for Trinity. RSEM starts with a reference transcriptome or assembly along with RNA-Seq reads generated from the sample and calculates normalized transcript abundance (meaning the number of RNA-Seq reads cor-responding to each reference transcriptome or assembly).
Although both Trinity and RSEM were designed for transcriptomic datasets (i.e., obtained from a single organism), it may be possible to apply them to metatranscriptomic data (i.e., obtained from a whole microbial community).
Bioinformatics
The use of computational analysis tools has become more important as DNA sequencing capabilities have grown, particularly in metagenomic and metatranscriptomic analysis, which can generate a huge volume of data. Many different bioinformatic pipelines have been developed for these purposes, often as open source platforms such as HUMAnN and the more recent HUMAnN2, MetaTrans, SAMSA, Leimena-2013 and mOTUs2.
HUMAnN2
HUMAnN2 is a bioinformatic pipeline designed from the previous HUMAnN software, which was developed during the
Human Microbiome Project
The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on i ...
(HMP), implementing a “tiered search” approach. In the first tier, HUMAnN2 screens DNA or RNA reads with MetaPhlAn2 in order to identify already-known microbes and constructing a sample-specific database by merging pangenomes of annotated species; in the second tier, the algorithm performs a mapping of the reads against the assembled pangenome database; in the third tier, non-aligned reads are used for a translated search against a protein database.
MetaTrans
MetaTrans is a pipeline that exploits
multithreading to improve efficiency. Data is obtained from paired-end RNA-Seq, mainly from
16S RNA for taxonomy and mRNA for gene expression levels. The pipeline is divided in 4 major steps. Firstly, paired-end reads are filtered for quality control purposes, then sorted and filtered for taxonomic analysis (by removal of tRNA sequences) or functional analysis (by removal of both tRNA and rRNA reads). For the taxonomic analysis, sequences are mapped against 16S rRNA Greengenes v13.5 database using SOAP2, while for functional analysis sequences are mapped against a functional database such as MetaHIT-2014 always by using SOAP2 tool. This pipeline is highly flexible, since it offers the possibility to use third-party tools and improve single modules as long as the general structure is preserved.
SAMSA
This pipeline is designed specifically for metatranscriptomics data analysis, by working in conjunction with the
MG-RAST server for metagenomics. This pipeline is simple to use, requires low technical preparation and computational power and can be applied to a wide range of microbes. First, sequences from raw sequencing data are filtered for quality and then submitted to MG-RAST (which performs further steps such as quality control, gene calling, clustering of
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
sequences and use of sBLAT on each cluster to detect the best matches). Matches are then aggregated for taxonomic and functional analysis purposes.
Leimena-2013
This pipeline does not have an official name and is usually referred to using the first author of the article in which it is described. This algorithm foresees the implementation of alignment tools such as BLAST and MegaBLAST. Reads are clustered in groups of identical sequences and then processed for in-silico removal of
tRNA
Transfer ribonucleic acid (tRNA), formerly referred to as soluble ribonucleic acid (sRNA), is an adaptor molecule composed of RNA, typically 76 to 90 nucleotides in length (in eukaryotes). In a cell, it provides the physical link between the gene ...
and
rRNA
Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from ribosomal ...
sequences. Remaining reads are then mapped to NCBI databases using BLAST and MegaBLAST, then classified by their bitscore. Sequences with higher bitscores are used to predict phylogenetic origin and function, and lower-score reads are aligned with the more sensitive BLASTX and eventually can be aligned in protein databases so that their function can be characterized.
[
]
mOTUs2
Th
mOTUs2
profiler, which is based on essential housekeeping genes, is demonstrably well-suited for quantification of basal transcriptional activity of microbial community members. Depending on environmental conditions, the number of transcripts per cell varies for most genes. An exception to this are housekeeping genes that are expressed constitutively and with low variability under different conditions. Thus, the abundance of transcripts from such genes strongly correlate with the abundance of active cells in a community.
Microarrays
Another method that can be exploited for metatranscriptomic purposes is tiling microarrays. In particular, microarrays have been used to measure microbial transcription levels, to detect new transcripts and to obtain information about the structure of mRNAs (for instance, the UTR boundaries). Recently, it has also been used to find new regulatory ncRNA. However, microarrays are affected by some pitfalls:
*requirement of probe design
*low sensitivity
*prior knowledge of gene targets.
RNA-Seq can overcome these limitations: it does not require any previous knowledge about the genomes that have to be analysed and it provides high throughput validation of genes prediction, structure, expression. Thus, by combining the two approaches it is possible to have a more complete representation of bacterial transcriptome.
Limitations
*With its dominating abundance, ribosomal RNA strongly reduces the coverage of mRNA (usually the main focus of transcriptomic studies) in the total collected RNA.
*Extraction of high-quality RNA from some biological or environmental samples (such as feces) can be difficult.
* Instability of mRNA that compromises sample integrity even before sequencing.
*Experimental issues can affect the quantification of differences in expression among multiple samples: They can influence integrity and input RNA, as well as the amount of rRNA remaining in the samples, size section and gene models. Moreover, molecular base techniques are very prone to artefacts.
*Difficulties in differentiating between host and microbial RNA, although commercial kits for microbial enrichment are available. This may also be done in silico if a reference genome is available for the host.
*Transcriptome reference databases are limited in their coverage.
*Generally, large populations of cells are exploited in metatranscriptomic analysis, so it is difficult to resolve important variances that can exist between subpopulations. High variability in pathogen populations was demonstrated to affect disease progression and virulence
Virulence is a pathogen's or microorganism's ability to cause damage to a host.
In most cases, especially in animal systems, virulence refers to the degree of damage caused by a microbe to its host. The pathogenicity of an organism—its abili ...
.
*Both for microarray and RNA-Seq, it is difficult to set a real threshold to classify genes as “expressed”, due to the high dynamic range in gene expression.
*The presence of mRNA is not always associated with the actual presence of the respective protein.
Applications
Human gut microbiome
The gut microbiome
Gut microbiota, gut microbiome, or gut flora are the microorganisms, including bacteria, archaea, fungi, and viruses, that live in the digestive tracts of animals. The gastrointestinal metagenome is the aggregate of all the genomes of the g ...
has emerged in recent years as an important player in human health. Its prevalent functions are related to the fermentation of indigestible food components, competitions with pathogen, strengthening of the intestinal barrier, stimulation and regulation of the immune system.
Although much has been learnt about the microbiome community in the last years, the wide diversity of microorganisms and molecules in the gut requires new tools to enable new discoveries. By focusing on changes in the expression of the genes, metatrascriptomics can generate a more dynamic picture of the state and activity of the microbiome than metagenomics. It has been observed that metatranscriptomic functional profiles are more variable than what might have been reckoned only by metagenomic information. This suggests that non-housekeeping genes are not stably expressed in situ
is a Latin phrase meaning 'in place' or 'on site', derived from ' ('in') and ' ( ablative of ''situs'', ). The term typically refers to the examination or occurrence of a process within its original context, without relocation. The term is use ...
One example of metatranscriptomic application is in the study of the gut microbiome in inflammatory bowel disease. Inflammatory bowel disease
Inflammatory bowel disease (IBD) is a group of inflammatory conditions of the colon and small intestine, with Crohn's disease and ulcerative colitis (UC) being the principal types. Crohn's disease affects the small intestine and large intestine ...
(IBD) is a group of chronic diseases of the digestive tract that affects millions of people worldwide.
Several human genetic mutations have been linked to an increased susceptibility to IBD, but additional factors are needed for the full development of the disease.
Regarding the relationship between IBD and gut microbiome, it is known that there is a dysbiosis
Dysbiosis (also called dysbacteriosis) is characterized by a disruption to the microbiome resulting in an imbalance in the microbiota, changes in their functional composition and metabolic activities, or a shift in their local distribution. For e ...
in patients with IBD but microbial taxonomic profiles can be highly different among patients, making it difficult to implicate specific microbial species or strains in disease onset and progression. In addition, the gut microbiome composition presents a high variability over time among people, with more pronounced variations in patient with IBD.
The functional potential of an organism, meaning the genes and pathways encoded in its genome, provides only indirect information about the level or extent of activation of such functions. So, the measurement of functional activity (gene expression) is critical to understand the mechanism of the gut microbiome dysbiosis.
Alterations in transcriptional activity in IBD, established on the rRNA expression, indicate that some bacterial populations are active in patients with IBD, while other groups are inactive or latent.
A metatranscriptomics analysis measuring the functional activity of the gut microbiome reveals insights only partially observable in metagenomic functional potential, including disease-linked observations for IBD. It has been reported that many IBD-specific signals are either more pronounced or only detectable on the RNA level.
These altered expression profiles are potentially the result of changes in the gut environment in patients with IBD, which include increased levels of inflammation, higher concentrations of oxygen and a diminished mucous layer.
Metatranscriptomics has the advantage of allowing researchers to skip the assaying of biochemical products in situ (like mucus or oxygen) and enables evaluation of effects of environmental changes on microbial expression patterns in vivo for large human populations. In addition, it can be coupled with longitudinal sampling to associate modulation of activity with the disease progression. Indeed, it has been shown that while a particular path may remain stable over time at the genomic level, the corresponding expression varies with the disease severity. This suggests that microbial dysbiosis affect the gut health through changing in the transcriptional programmes in a stable community. In this way, metatranscriptomic profiling emerges as an important tool for understanding the mechanisms of that relationship.
Some technical limitations of the RNA measurements in stool are related to the fact that the extracted RNA can be degraded and, if not, it still represents only the organisms presents in the stool sample.
Other
*Directed culturing: has been used to understand nutritional preferences of organisms in order to allow the preparation of a proper culture medium, resulting in a successful isolation of microbes in vitro.
*Identify potential virulence factors: through comparative transcriptomics, in order to compare different transcriptional responses of related strains or species after specific stimuli.
*Identify host-specific biological processes and interactions For this purpose, it's important to develop new technologies which allow the detection, at the same time, of changes in the expression levels of some genes.
Examples of techniques applied:
Microarrays: allow the monitoring of changes in the expression levels of many genes in parallel for both host and pathogen. First microarray approaches have shown the first global analysis of gene expression changes in pathogens such as Vibrio cholerae
''Vibrio cholerae'' is a species of Gram-negative bacteria, Gram-negative, Facultative anaerobic organism, facultative anaerobe and Vibrio, comma-shaped bacteria. The bacteria naturally live in Brackish water, brackish or saltwater where they att ...
, Borrelia burgdorferi
''Borrelia burgdorferi'' is a bacterial species of the spirochete class in the genus '' Borrelia'', and is one of the causative agents of Lyme disease in humans. Along with a few similar genospecies, some of which also cause Lyme disease, it m ...
, Chlamydia trachomatis
''Chlamydia trachomatis'' () is a Gram-negative, Anaerobic organism, anaerobic bacterium responsible for Chlamydia infection, chlamydia and trachoma. ''C. trachomatis'' exists in two forms, an extracellular infectious elementary body (EB) and an ...
, Chlamydia pneumoniae
Chlamydia, or more specifically a chlamydia infection, is a sexually transmitted infection caused by the bacterium ''Chlamydia trachomatis''. Most people who are infected have no symptoms. When symptoms do appear, they may occur only several w ...
and Salmonella enterica
''Salmonella enterica'' (formerly ''Salmonella choleraesuis'') is a rod-shaped, flagellate, facultative anaerobic, Gram-negative bacterium and a species of the genus ''Salmonella''. It is divided into six subspecies, arizonae (IIIa), diarizonae ...
, revealing the strategies that are used by these microorganisms to adapt to the host.
In addition, microarrays only provide the first global insights about the host innate immune response to PAMPs, as the effects of bacterial infection on the expression of various host factor.
Anyway, the detection through microarrays of both organisms at the same time could be problematic.
Problems:
*Probe selection (hundreds of millions of different probes)
*Cross-hybridization
*Need of expensive chips (with the proper design; high-density arrays)
*Require the pathogen and host cells to be physically separated before gene expression analysis (eukaryotic cells’ transcriptomes are larger in comparison to the pathogens’ ones, so could happen that the signal from pathogens’ RNAs is hidden).
*Loss of RNA molecules during the eukaryotic cells lysis
Lysis ( ; from Greek 'loosening') is the breaking down of the membrane of a cell, often by viral, enzymic, or osmotic (that is, "lytic" ) mechanisms that compromise its integrity. A fluid containing the contents of lysed cells is called a ...
.
Dual RNA-Seq: this technique allows the simultaneous study of both host and pathogen transcriptomes as well. It is possible to monitor the expression of genes at different time points of the infection process; in this way could it be possible to study the changes in cellular networks in both organisms starting from the initial contact until the manipulation of the host (interplay host-patogen).
*Potential: No need of expensive chips
*Probe-independent approach (RNA-seq provides transcript information without prior knowledge of mRNA sequences)
*High sensitivity.
*Possibility of studying the expression levels of even unknown genes under different conditions
Moreover, RNA-Seq is an important approach for identifying coregulated genes, enabling the organization of pathogen genomes into operon
In genetics, an operon is a functioning unit of DNA containing a cluster of genes under the control of a single promoter. The genes are transcribed together into an mRNA strand and either translated together in the cytoplasm, or undergo splic ...
s. Indeed, genome annotation has been done for some eukaryotic pathogens, such as Candida albicans
''Candida albicans'' is an opportunistic pathogenic yeast that is a common member of the human gut flora. It can also survive outside the human body. It is detected in the gastrointestinal tract and mouth in 40–60% of healthy adults. It is usu ...
, Trypanosoma brucei
''Trypanosoma brucei'' is a species of parasitic Kinetoplastida, kinetoplastid belonging to the genus ''Trypanosoma'' that is present in sub-Saharan Africa. Unlike other protozoan parasites that normally infect blood and tissue cells, it is excl ...
and Plasmodium falciparum
''Plasmodium falciparum'' is a Unicellular organism, unicellular protozoan parasite of humans and is the deadliest species of ''Plasmodium'' that causes malaria in humans. The parasite is transmitted through the bite of a female ''Anopheles'' mos ...
.
Despite the increasing sensitivity and depth of sequencing now available, there are still few published RNA-Seq studies concerning the response of the mammalian host cell to the infection.
References
{{Reflist
Bioinformatics
Genomics
Environmental microbiology
Microbiology techniques
Metagenomics