The Sequence Read Archive (SRA, previously known as the Short Read Archive) is a

bioinformatics Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

that provides a public repository for

DNA sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...

data, especially the "short reads" generated by

high-throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...

, which are typically less than 1,000

base pairs A base pair (bp) is a fundamental unit of double-stranded nucleic acids consisting of two nucleobases bound to each other by hydrogen bonds. They form the building blocks of the DNA double helix and contribute to the folded structure of both DNA ...

in length. The archive is part of the

International Nucleotide Sequence Database Collaboration The International Nucleotide Sequence Database Collaboration (INSDC) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: NIG's DNA Data Bank of Japan ( ...

(INSDC), and run as a collaboration between the NCBI, the

European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...

(EBI), and the

DNA Data Bank of Japan The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Da ...

(DDBJ). The archive was established by the

National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is lo ...

(NCBI) in 2007 in order to provide a repository for data produced by

RNA-Seq RNA-Seq (named as an abbreviation of RNA sequencing) is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also k ...

and

ChIP-Seq ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with Massively parallel signature sequencing, massively parallel DNA sequencing to identify t ...

studies as well as large-scale studies including the

Human Microbiome Project The Human Microbiome Project (HMP) was a United States National Institutes of Health (NIH) research initiative to improve understanding of the microbiota involved in human health and disease. Launched in 2007, the first phase (HMP1) focused on i ...

and the

1000 Genomes Project The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least o ...

. Originally called the Short Read Archive, the name was changed in anticipation of future sequencing technologies being able to produce longer sequence reads. The volume of data deposited in the Sequence Read Archive has grown rapidly. As of September 2010, 65% of the SRA was human genomic sequence, with another 16% relating to human

metagenome Metagenomics is the study of all genetic material from all organisms in a particular environment, providing insights into their composition, diversity, and functional potential. Metagenomics has allowed researchers to profile the microbial co ...

sequence reads. Much of this data was deposited through the 1000 Genomes Project. In June 2011, the data contained within the SRA passed 100 Terabases of DNA in volume. The preferred data format for files submitted to the SRA is the BAM format, which is capable of storing both aligned and unaligned reads. Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible

data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...

API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...

access and conversion to other formats such as

FASTQ FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Both the sequence letter and quality score are each encoded with a single ASCII character for brevity. ...

. NCBI announced their plan to close the NCBI SRA in February 2011 due to funding reduction. However, EBI and DDBJ announced that they would continue to support the SRA. In October 2011, NCBI announced continuation of funding for the SRA. Deposition of data in the SRA is mandated by most funding agencies and

open access journals Pulsus Group is a health informatics and digital marketing company and publisher of scientific, technical, and medical literature. It was formed in 1984, primarily to publish peer-reviewed medical journals. Pulsus published 98 hybrid and fu ...

Nature Publishing Group Nature Portfolio (formerly known as Nature Publishing Group and Nature Research) is a division of the international scientific publishing company Springer Nature that publishes academic journals, magazines, online databases, and services in scien ...

journals require that DNA and RNA sequencing data is made available through the SRA.

References

{{Reflist

External links

European Nucleotide Archive
page for searches in SRA
SRA homepage
at NCBI.
ERA submissions
at EBI.

at DDBJ. Genetics databases Genetics in the United Kingdom Science and technology in Cambridgeshire South Cambridgeshire District

See also

References

External links