Ensembl genome database project is a scientific project at the
European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the
genome
In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding ...
s of our own species and other
vertebrate
Vertebrates () comprise all animal taxon, taxa within the subphylum Vertebrata () (chordates with vertebral column, backbones), including all mammals, birds, reptiles, amphibians, and fish. Vertebrates represent the overwhelming majority of the ...
s and
model organisms.
Ensembl is one of several well known
genome browsers for the retrieval of
genomic information.
Similar
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
s and browsers are found at
NCBI and
the University of California, Santa Cruz (UCSC).
History
The human genome consists of three billion
base pairs, which code for approximately 20,000–25,000
gene
In biology, the word gene (from , ; "... Wilhelm Johannsen coined the word gene to describe the Mendelian units of heredity..." meaning ''generation'' or ''birth'' or ''gender'') can have several different meanings. The Mendelian gene is a b ...
s. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual
annotation, whereby a team of scientists tries to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex
pattern-matching of
protein
Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, respon ...
to
DNA. The Ensembl project was launched in 1999 in response to the imminent completion of the
Human Genome Project, with the initial goals of automatically annotate the human genome, integrate this annotation with available biological data and make all this knowledge publicly available.
In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software "pipelines" written in
Perl
Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it ...
) which creates a set of predicted gene locations and saves them in a
MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.
Over time the project has expanded to include additional species (including key
model organisms such as
mouse,
fruitfly and
zebrafish) as well as a wider range of genomic data, including
genetic variations and regulatory features. Since April 2009, a sister project,
Ensembl Genomes
Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.
The project is run by the European Bioinformatics Institute, and was launched in 2009 using the Ensembl technology. The main objective of the Ensem ...
, has extended the scope of Ensembl into invertebrate
metazoa,
plants,
fungi
A fungus (plural, : fungi or funguses) is any member of the group of Eukaryote, eukaryotic organisms that includes microorganisms such as yeasts and Mold (fungus), molds, as well as the more familiar mushrooms. These organisms are classified ...
,
bacteria
Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
, and
protists, focusing on providing taxonomic and evolutionary context to genes, whilst the original project continues to focus on vertebrates.
As of 2020, Ensembl supported over 50 000 genomes across both Ensembl and Ensembl Genomes databases, adding some new innovative features such a
Rapid Release a new website designed to make genome annotation data available more quickly to users, an
a new website to access to
SARS-CoV-2
Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) is a strain of coronavirus that causes COVID-19 (coronavirus disease 2019), the respiratory illness responsible for the ongoing COVID-19 pandemic. The virus previously had a No ...
reference genome.
Displaying genomic data
Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a
reference genome. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction.
Other displays show data at varying levels of resolution, from whole
karyotypes down to text-based representations of DNA and
amino acid
Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha ...
sequences, or present other types of display such as
trees of similar genes (
homologues) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as
FASTA.
Externally produced data can also be added to the display by uploading a suitable file in one of the supported formats, such as
BAM,
BED, or
PSL
PSL may refer to:
Sport
*Pakistan Super League, a Twenty20 cricket league in Pakistan
*Philippine Super Liga, professional volleyball league in the Philippines
*Pilipinas Super League, a professional basketball league in the Philippines
* Philipp ...
.
Graphics are generated using a suite of custom Perl modules based on
GD, the standard Perl graphics display library.
Alternative access methods
In addition to its website, Ensembl provides a REST
API and a Perl
API (Application Programming Interface) that models biological objects such as genes and proteins, allowing simple
script
Script may refer to:
Writing systems
* Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire
* Script (styles of handwriting)
** Script typeface, a typeface with characteristics of ha ...
s to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for
comparative genomics data), the variation API (for accessing SNPs, SNVs, CNVs..), and the functional genomics API (to access regulatory data).
The Ensembl website provides extensive information o
how to install and use the API
This software can be used to access the public
MySQL database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.
Large datasets can be retrieved using the
BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.
Last, there is a
FTPserver which can be used to download entire MySQL databases as well some selected data sets in other formats.
Current species
The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. As of 2022, there are 271 species registered, this includes:
Open source/mirrors
All data part of the Ensembl project is open access and all software is open source, being freely available to the scientific community, under a
CC BY 4.0 license. Currently, Ensembl database website is mirrored at four different locations worldwide to improve the service.
See also
*
List of sequenced eukaryotic genomes
*
List of biological databases
*
Sequence analysis
*
Sequence profiling tool
*
Sequence motif
*
UCSC Genome Browser
*
ENCODE
References
External links
*
VegaPre-EnsemblEnsembl genomesUCSC Genome BrowserNCBIEnsembl: Browsing chordate genomes on EBI Train OnLine
{{Authority control
Genetic engineering in the United Kingdom
Genome databases
Medical databases in the United Kingdom
Medical genetics
Science and technology in Cambridgeshire
South Cambridgeshire District
Wellcome Trust
Biological databases
Bioinformatics
Computational biology