The Vertebrate Genome Annotation (VEGA) database is a

biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genom ...

dedicated to assisting researchers in locating specific areas of the

genome In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...

and annotating genes or regions of vertebrate genomes. The VEGA browser is based on

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

web code and infrastructure and provides a public curation of known vertebrate genes for the scientific community. The VEGA website is updated frequently to maintain the most current information about vertebrate genomes and attempts to present consistently high-quality

annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...

of all its published vertebrate genomes or genome regions. VEGA was developed by the

Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is located on the Wellcome G ...

and is in close association with other annotation databases, such as ZFIN (The Zebrafish Information Network), the Havana Group and

GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part ...

. Manual annotation is currently more accurate at identifying splice variants,

pseudogenes Pseudogenes are nonfunctional segments of DNA that resemble functional genes. Most arise as superfluous copies of functional genes, either directly by DNA duplication or indirectly by reverse transcription of an mRNA transcript. Pseudogenes are ...

polyadenylation Polyadenylation is the addition of a poly(A) tail to an RNA transcript, typically a messenger RNA (mRNA). The poly(A) tail consists of multiple adenosine monophosphates; in other words, it is a stretch of RNA that has only adenine bases. In euk ...

features, non-coding regions and complex gene arrangements than automated methods.

History

The Vertebrate Genome Annotation (VEGA) database was first made public in 2004 by the Wellcome Trust Sanger Institute. It was designed to view manual annotations of human, mouse and zebrafish genomic sequences, and it is the central cache for genome sequencing centers to deposit their annotation of human chromosomes. Manual annotation of genomic data is extremely valuable to produce an accurate reference gene set but is expensive compared with automatic methods and so has been limited to model organisms. Annotation tools that have been developed at the Wellcome Trust Sanger Institute (WTSI) are now being used to fill that gap, as they can be used remotely and so open up viable community annotation collaborations. The HAVANA and VEGA Projects were run by Dr. Jennifer Harrow of the Wellcome Sanger Institute. VEGA has been archived since February 2017 and the HAVANA team moved to EMBL-EBI in June 2017.

Human genome

The Vega database is the central repository for the majority of genome sequencing centers to deposit their annotation of human chromosomes. Since the original VEGA publication, the number of human gene loci annotated has more than doubled to over 49,000 (September 2012 release), over 20,000 of which are predicted to be protein coding. The Havana Group as part of the consensus-coding sequence (CCDS) collaboration and whole-genome extension of the ENCODE project have fully manually annotated the human genome—which is available for reference, comparative analysis and sequence searches on the VEGA database. The final VEGA release was in February 2017 (release 68) and VEGA is now an archived site that will no longer be updated.

Other vertebrates

The VEGA database combines the information from individual vertebrate genome databases and brings them all together to allow easier access and comparative analysis for researchers. The human and vertebrate analysis and annotation (Havana) team at the Wellcome Trust Sanger Institute (WTSI) manually annotate the human, mouse and zebrafish genomes using the Otterlace/ZMap genome annotation tool. The Otterlace manual annotation system comprises a relational database that stores manual annotation data and supports the graphical interface, Zmap and is based on the Ensembl schema.

Zebrafish

The Zebrafish Genome, which is being fully sequenced and manually annotated. The Zebrafish genome currently has 18,454 annotated VEGA genes—of which 16,588 are projected protein-coding genes (September 2012, release).

Mouse

The Mouse genome currently has 23,322 annotated VEGA genes—of which 14,805 are projected protein-coding genes (June 2012, release). The loci chosen for manual annotation are spread throughout the genome, but some regions have received more focus than others: Chromosomes 2, 4, 11 and X, which have been fully annotated. The annotation shown in this release of Vega is from a datafreeze taken on 19 March 2012 and the gene structures are presented in the merged mouse geneset shown in Ensembl release 67. Vega also shows artificial loci generated by th
mouse Knockout programs

Pig

The Pig genome currently has annotated 2,842 VEGA genes—of which 2,264 are projected protein-coding genes (September 2012, release). The pig major histocompatibility complex (MHC), also known as the swine leukocyte antigen complex (SLA), spans a 2.4Mb region of submetacentric chromosome 7 (SSC7p1.1-q1.1). Implicated in the control of immune response and susceptibility to a range of diseases, the pig MHC plays a unique role in histocompatibility. Chromosomes X-WTSI and Y-WTSI are currently being annotated by Havana.

Dog, chimpanzee, wallaby, and gorilla

The Dog genome currently has 45 annotated VEGA genes—of which 29 are projected protein-coding genes (February 2005, release). The Chimpanzee genome currently has 124 annotated VEGA genes—of which 52 are projected protein-coding genes (January 2012, release). The Wallaby genome currently has 193 annotated VEGA genes—of which 76 are projected protein-coding genes (March 2009, release). The Gorilla genome currently has 324 annotated VEGA genes—of which 176 are projected protein-coding genes (March 2009, release).

Comparative analysis

In addition to full genomes, and unlike other browsers, VEGA also displays small finished regions of interest from genomes of other vertebrates, human haplotypes and mouse strains. Currently this comprises the finished sequence and annotation of the major histocompatibility complex (MHC) from different human haplotypes, and dog and pig [the latter of which is currently otherwise only available in very limited form in Ensembl Pre!. Additionally there is mouse NOD (non-obese diabetes) strain annotation of IDD (insulin-dependent diabetes) candidate regions and two more pig regions. Vega contains comparative pairwise analysis between specific genomic regions from either different species or from different haplotypes / strains. This is in contrast to Ensembl where many all genome versus all genome comparisons are performed. The analysis in Vega involves: 1. The identification of genomic alignments using LastZ. 2. Prediction of the orthologue pairs using the Ensembl gene tree pipeline. Note that although the pipeline generates phylogenetic genetrees, the limited scope of the Vega comparative analysis means that these will necessarily be incomplete and consequently only orthologs are shown on the website. 3. The manual identification of alleles in either different human haplotypes or mouse strains. There are five sets of analyses: 1. The MHC region has been compared between dog, pig (two assemblies), gorilla, chimpanzee, wallaby, mouse and eight human haplotypes:

* dog chromosome 12-MHC * gorilla chromosome 6-MHC * chimpanzee chromosome 6-MHC * wallaby chromosome 2-MHC * pig chromosome 7 on Sscrofa10.2 (24.7Mb to 29.8Mbp) * pig chromosome 7-MHC * mouse chromosome 17 (33.3Mbp to 38.9Mbp) * chromosome 6 on the human reference assembly (28Mbp to 34Mbp) * chromosome 6 MHC region in the human COX, QBL, APD, DBB, MANN, MCF and SSTO haplotypes (full length chromosome fragments)

2. Comparisons between the LRC regions of pig, gorilla and human (nine haplotypes):

*pig chromosome 6 (53.6Mbp to 54.0Mbp) *gorilla chromosome 19-LRC *human chromosome 19q13.4 (54.6Mbp to 55.6Mbp) on the reference assembly. *chromosome 19 LRC region in the COX_1, COX_2, PGF_1, PGF_2, DM1A, DM1B, MC1A and MC1B haplotypes (full length chromosome fragments). *Insulin dependent diabetes (Idd) regions on six mouse chromosomes (1, 3, 4, 6, 11 and 17) have been compared between the CL57BL/6 reference and one or more of the DIL Non-Obese Diabetic (NOD), CHORI-29 NOD, and the 129 strains. Further details are described here

3. The regions of the CL57BL/6 reference assembly used in these comparisons are:

*Idd3.1: chromosome 3, clones AC117584.11 to AC115749.12 *Idd4.1: chromosome 11, clones AL596185.12 to AL663042.5 *Idd4.2: chromosome 11, clones AL663082.5 to AL604065.7 *Idd4.2Q: chromosome 11, clones AL596111.7 to AL645695.18 *Idd5.1: chromosome 1, clones AL683804.15 to AL645534.20 *Idd5.3: chromosome 1, clones AC100180.12 to AC101699.9 *Idd5.4: chromosome 1, clones AC123760.9 to AC109283.8 *Idd6.1 + Idd6.2: chromosome 6, clones AC164704.4 to AC164090.3 *Idd6.3: chromosome 6, clones AC171002.2 to AC163356.2 *Idd9.1: chromosome 4, clones AL627093.17 to AL670959.8 *Idd9.1M: chromosome 4, clones AL611963.24 to AL669936.12 *Idd9.2: chromosome 4, clones CR788296.8 to AL626808.28 *Idd9.3: chromosome 4, clones AL607078.26 to AL606967.14 *Idd10.1: chromosome 3, clones AC167172.3 to AC131184.4 *Idd16.1: chromosome 17, clones AC125141.4 to AC167363.3 *Idd18.1: chromosome 3, clones AL845310.4 to AL683824.8 *Idd18.2: chromosome 3, clones AC123057.4 to AC129293.9

4. Comparisons between three specific regions:

*pig chromosome 17 (58.2Mbp to 67.4Mbp) *human chromosome 20q13.13-q13.33 (45.8Mbp to 62.4Mbp) *mouse chromosome 2 (168.3Mbp to 179.0Mbp)

5. Pairwise comparisons between three pairs of full length mouse and human chromosomes:

*human chromosome 1 and mouse chromosome 4 *human chromosome 17 and mouse chromosome 11 *human chromosome X and mouse chromosome X

References

{{reflist, 30em

External links

VEGA homepage

WTSI homepage

ENCODE homepage

ZFIN Homepage

Zebrafish Wild Type Strain Genome
Genetics databases Genetic engineering in the United Kingdom Medical databases Medical genetics Science and technology in Cambridgeshire South Cambridgeshire District Wellcome Trust