The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of

sequence assembly In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one g ...

and other

metadata Metadata (or metainformation) is "data that provides information about other data", but not the content of the data itself, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive ...

related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database (also known as EMBL-bank). The ENA is produced and maintained by the

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...

and is a member of the

International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: NIG's DNA Data Bank of Japan ( ...

(INSDC) along with the

DNA Data Bank of Japan The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Da ...

and

GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...

. The ENA has grown out of the EMBL Data Library which was released in 1982 as the first internationally supported resource for nucleotide sequence data. As of early 2012, the ENA and other INSDC member databases each contained complete

genome A genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA (or RNA in RNA viruses). The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as ...

s of 5,682 organisms and sequence data for almost 700,000. Moreover, the volume of data is increasing exponentially with a doubling time of approximately 10 months.

History

The European Nucleotide Archive originated from separate databases, the earliest of which was the EMBL Data Library, established in October 1980 at the

European Molecular Biology Laboratory The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 29 member states, two prospect member states, and one associate member state. EMBL was created in ...

(EMBL),

Heidelberg Heidelberg (; ; ) is the List of cities in Baden-Württemberg by population, fifth-largest city in the States of Germany, German state of Baden-Württemberg, and with a population of about 163,000, of which roughly a quarter consists of studen ...

. The first release of this

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

was made in April 1982 and contained a total of 568 separate entries consisting of around 500,000

base pair A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...

s. In 1984, referring to the EMBL Data Library, Kneale and Kennard remarked that "it was clear some years ago that a large computerized database of sequences would be essential for research in Molecular Biology". NucleotideSequences 86 87

Despite the primary distribution method at the time being via

magnetic tape Magnetic tape is a medium for magnetic storage made of a thin, magnetizable coating on a long, narrow strip of plastic film. It was developed in Germany in 1928, based on the earlier magnetic wire recording from Denmark. Devices that use magnetic ...

, by 1987, the EMBL Data Library was being used by an estimated 10,000 scientists internationally. The same year, the EMBL File Server was introduced to serve database records over

BITNET BITNET was a co-operative United States, U.S. university computer network founded in 1981 by Ira Fuchs at the City University of New York (CUNY) and Greydon Freeman at Yale University. The first network link was between CUNY and Yale. Backgrou ...

, EARN and the early

Internet The Internet (or internet) is the Global network, global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a internetworking, network of networks ...

. In May 1988 the journal ''

Nucleic Acids Research ''Nucleic Acids Research'' is an open-access peer-reviewed scientific journal published since 1974 by the Oxford University Press. The journal covers research on nucleic acids, such as DNA and RNA, and related work. According to the ''Journal Cita ...

'' introduced a policy stating that "manuscripts submitted to ucleic Acids Researchand containing or discussing sequence data must be accompanied by evidence that the data have been deposited with the EMBL Data Library." During the 1990s the EMBL Data Library was renamed the EMBL Nucleotide Sequence Database and was formally relocated to the

(EBI) from Heidelberg. In 2003, the Nucleotide Sequence Database was extended with the addition of the Sequence Version Archive (SVA), which maintains records of all current and previous entries in the database. A year later in June 2004, limits on the maximum sequence length for each record (then 350

kilobase A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...

s) were removed, allowing entire genome sequences to be stored as a single

entry. Following the uptake of

Sanger sequencing Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Fred ...

, the

Wellcome Trust Sanger Institute The Wellcome Sanger Institute, previously known as The Sanger Centre and Wellcome Trust Sanger Institute, is a non-profit organisation, non-profit British genomics and genetics research institute, primarily funded by the Wellcome Trust. It is l ...

(then known as The Sanger Centre) had begun cataloguing sequence reads along with quality information in a database called The Trace Archive. The Trace Archive grew substantially with the commercialisation of high-throughput parallel sequencing technologies by companies such as

Roche F. Hoffmann-La Roche AG, commonly known as Roche (), is a Switzerland, Swiss multinational corporation, multinational holding healthcare company that operates worldwide under two divisions: Pharmaceuticals and Diagnostics. Its holding company, ...

and Illumina. In 2008, the EBI combined the Trace Archive, EMBL Nucleotide Sequence Database (now also known as EMBL-Bank) and a newly developed Sequence (or Short) Read Archive (SRA) to make up the ENA, aimed at providing a comprehensive

nucleotide Nucleotides are Organic compound, organic molecules composed of a nitrogenous base, a pentose sugar and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both o ...

sequence archive. As a member of the

, the ENA exchanges data submissions each day with both the

and

EMBL Nucleotide Sequence Database

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) is the section of the ENA which contains high-level genome assembly details, as well as assembled sequences and their functional

annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented Marginalia, in the margin of book page ...

. EMBL-Bank is contributed to by direct submission from genome consortia and smaller research groups as well as by the retrieval of sequence data associated with

patent application A patent application is a request pending at a patent office for the grant of a patent for an invention described in the patent specification and a set of one or more claim (patent), claims stated in a formal document, including necessary officia ...

s. As of release 114 (December 2012), the EMBL Nucleotide Sequence Database contains approximately 5×10¹¹ nucleotides with an uncompressed filesize of 1.6

terabyte The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable un ...

Data classes

The EMBL Nucleotide Sequence Database supports a variety of data derived from different sources including, but not limited to: *

Expressed sequence tag In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has pro ...

s with their associated sample data. *Nucleotide sequence being generated from

whole genome sequencing Whole genome sequencing (WGS), also known as full genome sequencing or just genome sequencing, is the process of determining the entirety of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's ...

projects at varying stages of assembly, including complete

contig A contig (from ''contiguous'') is a set of overlapping DNA segments that together represent a consensus region of DNA.Gregory, S. ''Contig Assembly''. Encyclopedia of Life Sciences, 2005. In bottom-up sequencing projects, a contig refers to over ...

s and annotated, fully assembled sequence. *Data relating to

transcriptomics Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA, RNA transcripts. The information content of an organism is recorded in the DNA of its genome and Gene expression, expressed throu ...

, such as

complementary DNA In genetics, complementary DNA (cDNA) is DNA that was reverse transcribed (via reverse transcriptase) from an RNA (e.g., messenger RNA or microRNA). cDNA exists in both single-stranded and double-stranded forms and in both natural and engin ...

, with optional annotation. *Novel or extended annotations of existing coding sequences, for example new sequence versions with corrected

start Start can refer to multiple topics: * Takeoff, the phase of flight where an aircraft transitions from moving along the ground to flying through the air * Starting lineup in sports * Track and field#Starts use in race, Starts use in sport race * S ...

stop codon In molecular biology, a stop codon (or termination codon) is a codon (nucleotide triplet within messenger RNA) that signals the termination of the translation process of the current protein. Most codons in messenger RNA correspond to the additio ...

EMBL-Bank format

The EMBL Nucleotide Sequence Database uses a

flat file A flat-file database is a database stored in a file called a flat file. Records follow a uniform format, and there are no structures for indexing or recognizing relationships between records. The file is simple. A flat file can be a plain t ...

plaintext format to represent and store data which is typically referred to as EMBL-Bank format. EMBL-Bank format uses a different

syntax In linguistics, syntax ( ) is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituenc ...

to the records in DDBJ and GenBank, though each format uses certain standardised nomenclature, such as

taxonomies image:Hierarchical clustering diagram.png, 280px, Generalized scheme of taxonomy Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme o ...

as defined by the

NCBI The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is loca ...

Taxon database. Each line of an EMBL-format file begins with a two-letter code, such as AC to label the accession number and KW for a list of keywords relevant to the record; each record ends with //.

Sequence Read Archive

The ENA operates an instance of the Sequence Read Archive (SRA), an archival repository of sequence reads and analyses which are intended for public release. Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. Currently, the archive accepts sequence reads generated by next-generation sequencing platforms such as the Illumina Genome Analyzer and ABI SOLiD as well as some corresponding analyses and alignments. The SRA operates under the guidance of the

(INSDC) and is the fastest-growing repository in the ENA. In 2010 the Sequence Read Archive made up approximately 95% of the

data available through the ENA, encompassing over 500,000,000,000 sequence reads made up of over 60 trillion (6×10¹³) base pairs. Almost half of this data was deposited in relation to the

1000 Genomes Project The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least o ...

wherein the researchers published their sequence data to the SRA in

real-time Real-time, realtime, or real time may refer to: Computing * Real-time computing, hardware and software systems subject to a specified time constraint * Real-time clock, a computer clock that keeps track of the current time * Real-time Control Syst ...

. In total, as of September 2010, 65% of the Sequence Read Archive was human genomic sequence, with another 16% relating to human metagenome sequence reads. The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...

access and conversion to other formats such as FASTQ.

Data access

The data contained in the ENA can be accessed manually or programmatically via REST URL through the ENA browser. Initially limited to the Sequence Read Archive, the ENA browser now also provides access to the Trace Archive and EMBL-Bank, allowing file retrieval in a range of formats including

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...

HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...

FASTA FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics. History The original FASTA program ...

and FASTQ. Individual records can be accessed using their accession numbers and other text queries are enabled through the

EB-eye EBI Search is a scalable text search engine that provides easy and uniform access to the biological data resources and services hosted at the European Bioinformatics Institute (EBI). The original and primary purpose of EBI Search is to provide ...

search engine. Additionally,

sequence similarity Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speci ...

-based searches implemented using

De Bruijn graph In graph theory, an -dimensional De Bruijn graph of symbols is a directed graph representing overlaps between sequences of symbols. It has vertices, consisting of all possible sequences of the given symbols; the same symbol may appear multiple ...

s offer another method of retrieving records from the ENA. The ENA is accessible via the EBI

SOAP Soap is a salt (chemistry), salt of a fatty acid (sometimes other carboxylic acids) used for cleaning and lubricating products as well as other applications. In a domestic setting, soaps, specifically "toilet soaps", are surfactants usually u ...

and REST APIs, which also offer access to other databases hosted at the EBI, such as

Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...

and InterPro.

Storage

The European Nucleotide Archive handles large volumes of data which pose a significant storage challenge. As of 2012, the ENA's storage requirements continue to grow exponentially, with a doubling time of approximately 10 months. To manage this increase, the ENA selectively discards less-valuable sequencing platform data and implements advanced compression strategies. The CRAM reference-based compression toolkit was developed to help reduce ENA storage requirements.

Funding

Currently the ENA is funded jointly by the

, the

European Commission The European Commission (EC) is the primary Executive (government), executive arm of the European Union (EU). It operates as a cabinet government, with a number of European Commissioner, members of the Commission (directorial system, informall ...

and the

Wellcome Trust The Wellcome Trust is a charitable foundation focused on health research based in London, United Kingdom. It was established in 1936 with legacies from the pharmaceutical magnate Henry Wellcome (founder of Burroughs Wellcome, one of the predec ...

. The emerging ELIXIR framework, coordinated by EBI director Janet Thornton, aims to secure a sustainable European funding infrastructure to support the continued availability of

life science Life, also known as biota, refers to matter that has biological processes, such as signaling and self-sustaining processes. It is defined descriptively by the capacity for homeostasis, organisation, metabolism, growth, adaptation, respon ...

databases such as the ENA.

References

External links

European Nucleotide ArchiveEMBL Nucleotide Sequence DatabaseThe European Nucleotide Archive: Quick tour
{{Bioinformatics Genetics databases Genetics in the United Kingdom Genome databases Genomics organizations Information technology organizations based in Europe Research institutes in Cambridgeshire South Cambridgeshire District