The Cancer Genome Anatomy Project (CGAP), created by the

National Cancer Institute The National Cancer Institute (NCI) coordinates the United States National Cancer Program and is part of the National Institutes of Health (NIH), which is one of eleven agencies that are part of the U.S. Department of Health and Human Services. ...

(NCI) in 1997 and introduced by

Al Gore Albert Arnold Gore Jr. (born March 31, 1948) is an American former politician, businessman, and environmentalist who served as the 45th vice president of the United States from 1993 to 2001 under President Bill Clinton. He previously served as ...

, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the

NCBI The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is loca ...

contribute to CGAP's databases. The eventual outcomes of CGAP include establishing a correlation between a particular cancer's progression with its therapeutic outcome, improved evaluation of treatment and development of novel techniques for prevention, detection and treatment. This is achieved by characterisation of biological tissue mRNA products.

Research

Background

The fundamental cause of cancer is the inability for a cell to regulate its gene expression. To characterise a specific type of cancer, the proteins that are produced from the altered gene expression or the mRNA precursor to the protein can be examined. CGAP works to associate a particular cell's expression profile, molecular signature or

transcriptome The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The ...

, which is essentially the cell's fingerprint, with the cell's phenotype. Therefore, expression profiles exist with consideration to cancer type and stage of progression.

Sequencing

CGAP's initial goal was to establish a Tumor Gene Index (TGI) to store the expression profiles. This would have contributions to both new and existing databases. This contributed to two types of libraries, the and later . This was performed in a series of steps: * Cell contents are washed over plates with poly T sequences. This will bind Poly-A tails that exist only on mRNA molecules, therefore selectively keeping mRNA. * The isolated mRNA is processed into a

cDNA In genetics, complementary DNA (cDNA) is DNA that was reverse transcribed (via reverse transcriptase) from an RNA (e.g., messenger RNA or microRNA). cDNA exists in both single-stranded and double-stranded forms and in both natural and engin ...

transcript through reverse transcription and DNA polymerisation reactions. * The resulting double stranded DNA is then incorporated into

E.coli ''Escherichia coli'' ( )Wells, J. C. (2000) Longman Pronunciation Dictionary. Harlow ngland Pearson Education Ltd. is a gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus ''Escherichia'' that is commonly foun ...

plasmids A plasmid is a small, extrachromosomal DNA molecule within a cell that is physically separated from chromosomal DNA and can replicate independently. They are most commonly found as small circular, double-stranded DNA molecules in bacteria and ...

. Each bacterium now contains one unique cDNA and is replicated to produce clones with the same genetic information. This is termed a

cDNA library A cDNA library is a combination of cloned cDNA (complementary DNA) fragments inserted into a collection of host cells, which constitute some portion of the transcriptome of the organism and are stored as a "library". cDNA is produced from fully t ...

. * The library can then sequenced by

high-throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...

techniques. This can characterise both the different genes expressed by the original cell and the amount of expression of each gene. The TGI focused on prostate, breast, ovarian, lung and colon cancers at first, and CGAP extended to other cancers in its research. Practically, issues arose which CGAP accounted for as new technologies became available. Many cancers occur in tissues with multiple cell types. Traditional techniques took the whole tissue sample and produced bulk tissue cDNA libraries. This cellular heterogeneity made gene expression information in terms of cancer biology less accurate. An example is prostate cancer tissue where epithelial cells, which have been shown to be the only cell type give rise to cancer, only consist 10% of the cell count. This led to development of

laser capture microdissection Laser capture microdissection (LCM), also called microdissection, laser microdissection (LMD), or laser-assisted microdissection (LMD or LAM), is a method for isolating specific cells of interest from microscopic regions of tissue/cells/organisms ...

(LCM), a technique that can isolate individual cell types individual cells, which gave rise to cDNA libraries of specific cell types. The sequencing of cDNA will produce the entire mRNA transcript that generated it. Practically, only part of the sequence is required to uniquely identify the mRNA or protein associated. The resultant part of the sequence was termed the

expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has pro ...

(EST) and is always at the end of the sequence close to the poly A tail. EST data are stored in a database called . ESTs only need to be around 400 bases long, but with NGS sequencing techniques this will still produce low quality reads. Therefore, an improved method called

serial analysis of gene expression Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those tr ...

(SAGE) is also used. This method identifies, for each cDNA transcript molecule produced from a cell's gene expression, regions only 10-14 bases long anywhere along the read sequence, sufficient to uniquely identify that cDNA transcript. These bases are cut out and linked together, then incorporated into bacterial plasmids as mentioned above. SAGE libraries have better read quality and generate a larger amount of data when sequenced, and since transcripts are compared in absolute rather than relative levels, SAGE has the advantage of requiring no normalisation of data via comparison with a reference.

Resources

Following sequencing and establishment of libraries, CGAP incorporates the data along with existing data sources and provides various databases and tools for analysis. A detailed description of tools and databases created or used by CGAP can be found on NCI's CGAP website. Below are some of the initiatives or research tools provided by CGAP.

Genomic Annotation Initiative

The goal of the Cancer Genome Anatomy Project Genome Annotation Initiative (CGAP-GAI) is to discover and catalogue

single nucleotide polymorphisms In genetics and bioinformatics, a single-nucleotide polymorphism (SNP ; plural SNPs ) is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in ...

(SNPs) that correlate with cancer initiation and progression. CGAP-GAI have created a variety of tools for the discovery, analysis and display of SNPs. SNPs are valuable in cancer research as they can be used in several different genetic studies, commonly to track transmission, identify alternate forms of genes and analyze complex molecular pathways that regulate cell metabolism, growth, or differentiation. SNPs in the CGAP-GAI are either found as a result of resequencing genes of interest in different individuals or looking through existing human EST databases and making comparisons. It examines transcripts from healthy individuals, individuals with disease, tumour tissue and cell lines from a large set of individuals; therefore the database is more likely to include rare disease mutations in addition to high frequency variants. A common challenge with SNP detection is differentiation between sequencing errors with actual polymorphisms. SNPs that are found undergo statistical analysis using the CGAP SNP pipeline to calculate the probability that the variant is in fact a polymorphism. High probability SNPs are validated and there are tools available that make predictions as to whether function is altered. To make the data easily accessible CGAP-GAI has a number of tools which can display both a sequence alignment and assembly overview with context to sequences from which they were predicted. SNPs are annotated and integrated genetic/physical maps are often determined.

Cancer Chromosomal Aberration Project (CCAP)

Genomic instability is a common feature of cancer; therefore understanding structural and chromosomal abnormalities can give insight into the progression of disease. The Cancer Chromosome Aberration Project (cCAP) is a CGAP supported initiative used for defining chromosome structure and to characterize rearrangements that are associated with malignant transformation. It incorporates the online version of Mitelman's database, created by Felix Mitelman, Bertil Johansson and Fredrik Mertens prior to the creation of CGAP, another compilation of known chromosomal rearrangements. The CCAP has several goals: * Integration of cytogenetic and physical maps of the human genome * Generate a clone repository of BAC clones across the genome that are genetically and physically mapped * Develop a platform for parallel database correlation of cancer associated aberrations (Fluorescent in-situ hybridization(FISH)-mapped BAC clone database) * Integrating three cytogenetic analyses techniques (spectral karyotyping, comparative genome hybridization, and FISH) to refine defining nomenclature for karyotypic aberrations. There is cytogenetic information from over 64,000 patient cases, including more than 2000 gene fusions, contained in the database. As part of this project there is a repository of physically and cytogenetically mapped BAC clones for the human genome that are physically available through a network of distributors. The CCAP Clone maps have been mapped cytogenetically using FISH at a resolution of 1-2Mb across the human genome, and physically mapped using sequence-tagged sites (STS). The data for BAC clones are also available through CGAP and NCBI databases.

Other Resources

Listed below are some other resources available through CGAP.

Digital Differential Display

An early technique used by CGAP is digital differential display (DDD), which uses the Fisher exact test to compare libraries against each other, in order to find a significant difference between populations. CGAP ensured that DDD was able to compare between all cDNA libraries in , and not just those which were generated by CGAP.

Mammalian Gene Collection (MGC)

The MGC provides researchers with full-length protein information from cDNA, unlike EST or SAGE databases which only provide the identifying tag. The project includes human and mouse genes, and later cow cDNAs generated by

Genome Canada Genome Canada is a non-profit organization that aims to use genomics-based technologies to improve the lives of Canadians. It is funded by the Government of Canada. Genome Canada provides large-scale investments that develop new technologies, conne ...

were added.

SAGEmap

SAGEmap is the database used to store SAGE libraries. Over 3.4 million SAGE tags exist as of 2001. Tools can be used to map SAGE tags to

UniGene UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). Info ...

clusters, a database that stores transcriptomes. This allows for easier identification of a SAGE tag's corresponding sequence. In addition, there are tools associated with SAGEmaps: * Digital Northern is used to measure the expression level of specific genes, * SAGE Anatomic Viewer displays this information visually, and compares it between normal and cancerous cells, * Ludwig Transcript (LT) Viewer shows alternative transcripts and their possible associated SAGE tags, * Expression Matrix () shows gene expression levels throughout mouse development for different tissue types.

Gene Finder

The CGAP locates a gene or a list of genes based on specified search criteria and provides links to different NCI and NCBI databases. A gene can be searched for specifically using a unique identifier such as gene symbols and Entrez gene number as well as generally by function, tissue or keyword. Other gene tools accessible through the CGAP web interface include the Gene Ontology Browser (GO) and the Nucleotide BLAST tool.

Gene Expression Tools

cDNA and cDNA Digital gene expression displayer (DGED) together are used to find statistically significant genes of interest that are differentially expressed within two pools of cDNA libraries, typically a comparison is made between normal and cancer tissues. Statistical significance is determined by DGED using a combination of Bayesian statistics and a sequence

odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of event A taking place in the presence of B, and the odds of A in the absence of B ...

to calculate a probability. cDNA DGED relies on the UniGene relational database while the cDNA uses a flat file database that is not available online.

Outcomes and Future

CGAP is now a centralised location for several genomics tools and genetic databases and is employed widely in cancer and molecular biology research. The databases established by CGAP continues to contribute to knowledge of cancers in terms of their pathways and progression. The transcriptome databases can also be used in non-cancer related research, as they contain information that can be used to quickly and easily identify particular sequenced genes. The data also has clinical impact, as cDNAs can be used to create microarrays for diagnosis and treatment comparison purposes. CGAP has been used in many studies, with examples including: * Characterising differences in normal and cancerous endothelial cell gene expression * Identifying irregular gene expression as markers for glioblastomas and ovarian cancer * Identifying gene expression specific to prostate tissue * Comparison of proteins expressed in normal and cancerous reproductive tissue In addition, the vast amount of data generated by CGAP has prompted for improvement of data analysis and mining techniques, with examples including: * Comparison of gene expression from multiple cDNA libraries * Improved techniques for mining EST libraries * Integral, large scale studies of human transcriptome analysis

References

{{reflist, 30em

External links

CGAP slide presentation

CGAP Catalog of resources
Cancer genome databases Genomics organizations Medical genetics