Paired-end tags (PET) (sometimes "Paired-End diTags", or simply "ditags") are the short sequences at the
5’ and
3' end
Directionality, in molecular biology and biochemistry, is the end-to-end chemical orientation of a single strand of nucleic acid. In a single strand of DNA or RNA, the chemical convention of naming carbon atoms in the nucleotide pentose-sugar- ...
s of a
DNA fragment which are unique enough that they (theoretically) exist together only once in a
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
, therefore making the sequence of the DNA in between them available upon search (if full-genome sequence data is available) or upon further sequencing (since tag sites are unique enough to serve as
primer annealing sites). Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or
cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely.
[Fullwood MJ, Wei CL, Liu ET, Ruan Y. 2009. Next-Generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Research. 19:521–532. {{PMID, 19339662] However, longer sequences are more practical for mapping
reads uniquely. The
endonucleases
Endonucleases are enzymes that cleave the phosphodiester bond within a polynucleotide chain. Some, such as deoxyribonuclease I, cut DNA relatively nonspecifically (without regard to sequence), while many, typically called restriction endonuclea ...
(discussed below) used to produce PETs give longer tags (18/20 base pairs and 25/27 base pairs) but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency.
After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the
Sanger method, which has a longer read length.
Since the tag sequences are short, individual PETs are well suited for
next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.
Constructing the PET library
PET libraries are typically prepared in two general methods: cloning based and cloning-free based.
Cloning based
Fragmented genomic DNA or complementary DNA (cDNA) of interest is cloned into
plasmid vectors. The cloning sites are flanked with adaptor sequences that contain restriction sites for endonucleases (discussed below). Inserts are ligated to the plasmid vectors and individual vectors are then
transformed into ''E. coli'' making the PET library. PET sequences are obtained by purifying plasmid and digesting with specific endonuclease leaving two short sequences on the ends of the vectors. Under intramolecular (dilute) conditions, vectors are re-circularized and ligated, leaving only the ditags in the vector. The sequences unique to the clone are now paired together. Depending on the
next-generation sequencing technique, PET sequences can be left singular, dimerized, or concatenated into long chains.
Cloning-free based
Instead of cloning, adaptors containing the endonuclease sequence are ligated to the ends of fragmented genomic DNA or cDNA. The molecules are then self-circularized and digested with endonuclease, releasing the PET.
Before sequencing, these PETs are ligated to adaptors to which PCR primers anneal for amplification.
The advantage of cloning based construction of the library is that it maintains the fragments or cDNA intact for future use. However, the construction process is much longer than the cloning-free method. Variations on library construction have been produced by
next-generation sequencing companies to suit their respective technologies.
Endonucleases
Unlike other endonucleases, the MmeI (type IIS) and EcoP15I (type III)
restriction endonucleases cut downstream of their target binding sites. MmeI cuts 18/20 base pairs downstream and EcoP15I cuts 25/27 base pairs downstream. As these restriction enzymes bind at their target sequences located in the adaptors, they cut and release vectors that contain short sequences of the fragment or cDNA ligated to them, producing PETs.
PET applications
#DNA-PET: Because PET represent connectivity between the tags, the use of PET in genome re-sequencing has advantages over the use of
single reads. This application is called
pairwise end sequencing, known colloquially as ''double-barrel shotgun sequencing''. Anchoring one half of the pair uniquely to a single location in the genome allows mapping of the other half that is ambiguous. Ambiguous reads are those that map to more than a single location. This increased efficiency reduces the cost of sequencing as these ambiguous sequences, or reads, would normally be discarded. The connectivity of PET sequences also allows detection of structural variations:
insertions,
deletions,
duplications,
inversions,
translocations.
During the construction of the PET library, the fragments can be selected to all be of a certain size. After mapping, the PET sequences are thus expected to be consistently a particular distance away from each other. A discrepancy from this distance indicates a structural variation between the PET sequences. For example (Figure on the right): a deletion in the sequenced genome will have reads that map further away than expected in the reference genome as the reference genome will have a segment of DNA that is not present in the sequenced genome.
#
ChIP-PET: The combined use of chromatin immunoprecipitation (
ChIP) and PET is used to detect regions of DNA bound by a protein of interest. ChIP-PET has the advantage over single read sequencing by reducing ambiguity of the reads generated. The advantage over chip hybridization (
ChIP-Chip) is that hybridization tiling arrays do not have the statistical sensitivity that sequence reads have. However, ChIP-PET,
ChIP-Seq and ChIP-chip have all been highly successful.
#
ChIA-PET
Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET or ChIA-PETS) is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequen ...
: The application of PET sequencing on chromatin interaction analysis. It is a genome-wide strategy for finding ''de novo'' long-range interactions between DNA elements bound by protein factors.
[Fullwood MJ, Liu MH, Pan YF et al. 2009. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature. 462: 58-64.] The first ChIA-PET was developed by Fullwood ''et al.''. (2009)
to generate a map of the interactions between chromatin bound by
oestrogen receptor α (ER-α) in oestrogen-treated human breast
adenocarcinoma
Adenocarcinoma (; plural adenocarcinomas or adenocarcinomata ) (AC) is a type of cancerous tumor that can occur in several parts of the body. It is defined as neoplasia of epithelial tissue that has glandular origin, glandular characteristics, o ...
cells.
ChIA-PET is an unbiased way to analyze interactions and higher-order chromatin structures because it can detect interactions between unknown DNA elements. In contrast,
3C and 4C methods are used to detect interactions involving a specific target region in the genome. ChIA-PET is similar to finding
fusion genes A fusion gene is a hybrid gene formed from two previously independent genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Fusion genes have been found to be prevalent in all main types of human neopla ...
through RNA-PET in that the paired tags map to different regions in the genome.
However, ChIA-PET involves artificial ligations between different DNA fragments located at different genomic regions, rather than naturally occurring fusion between two genomic regions as in RNA-PET.
#RNA-PET: This application is used for studying the
transcriptome
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The ...
: transcripts, gene structures, and gene expressions.
[Ng P, Wei CL, Sung WK et al. 2005. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods. 2: 105–111.] The PET library is generated using full length cDNAs, so the ditags represent the 5’ capped and the 3’ polyA tail signatures of individual transcripts. Therefore, RNA-PET is especially useful for demarcating the boundaries of transcription units. This will help identify alternative transcription start sites and
polyadenylation
Polyadenylation is the addition of a poly(A) tail to an RNA transcript, typically a messenger RNA (mRNA). The poly(A) tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In eu ...
sites of genes.
RNA-PET could also be used to detect
fusion genes A fusion gene is a hybrid gene formed from two previously independent genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Fusion genes have been found to be prevalent in all main types of human neopla ...
and
trans-splicing, but further experiment is needed to distinguish between them.
[Ruan Y, Ooi HS, Choo SW et al. 2007. Fusion transcripts and transcribed retrotransposed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs). Genome Res. 17: 828–838.] Other methods of finding the boundaries of transcripts include the single-tag strategies
CAGE,
SAGE, and the most recent
SuperSAGE
Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those tra ...
, with the CAGE and 5’ SAGE defining the transcription start sites and the 3’ SAGE defining the
polyadenylation
Polyadenylation is the addition of a poly(A) tail to an RNA transcript, typically a messenger RNA (mRNA). The poly(A) tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In eu ...
sites.
The advantages of PET sequencing over these methods are that PET identify both ends of the transcripts and, at the same time, provide more specificity when mapping back to the genome. Sequencing the cDNAs can reveal the structures of transcripts in great details, but this approach is much more expensive than RNA-PET sequencing, especially for characterizing the whole
transcriptome
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The ...
.
The major limitation of RNA-PET is the lack of information regarding the organization of the internal
exons
An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term ''exon'' refers to both the DNA sequence within a gene and to the corresponding sequenc ...
of transcripts. Therefore, RNA-PET is not suitable for detecting
alternative splicing. In addition, if the
cloning
Cloning is the process of producing individual organisms with identical or virtually identical DNA, either by natural or artificial means. In nature, some organisms produce clones through asexual reproduction. In the field of biotechnology, c ...
procedure is used construct the cDNA library before generating the PETs, cDNAs that are difficult to clone (as a result of long transcripts) would have lower coverage.
Similarly, transcripts (or transcript isoforms) with low expression levels would likely be under-represented as well.
References
Molecular biology
Laboratory techniques
Molecular biology techniques
DNA sequencing