RefSeq
   HOME

TheInfoList



OR:

The Reference Sequence (RefSeq)
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
is an
open access Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or libre op ...
, annotated and curated collection of publicly available
nucleotide Nucleotides are organic molecules consisting of a nucleoside and a phosphate. They serve as monomeric units of the nucleic acid polymers – deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), both of which are essential biomolecu ...
sequences ( DNA, RNA) and their
protein Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues. Proteins perform a vast array of functions within organisms, including catalysing metabolic reactions, DNA replication, res ...
products. RefSeq was first introduced in 2000. This database is built by
National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. Th ...
(NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule (i.e. DNA, RNA or protein) for major organisms ranging from
viruses A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism. Viruses infect all life forms, from animals and plants to microorganisms, including bacteria and archaea. Since Dmitri Ivanovsky's ...
to
bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
to
eukaryotes Eukaryotes () are organisms whose cells have a nucleus. All animals, plants, fungi, and many unicellular organisms, are Eukaryotes. They belong to the group of organisms Eukaryota or Eukarya, which is one of the three domains of life. Bacter ...
. For each
model organism A model organism (often shortened to model) is a non-human species that is extensively studied to understand particular biological phenomena, with the expectation that discoveries made in the model organism will provide insight into the workin ...
, ''RefSeq'' aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. ''RefSeq'' is limited to major organisms for which sufficient data are available (121,461 distinct "named"
organisms In biology, an organism () is any living system that functions as an individual entity. All organisms are composed of cells ( cell theory). Organisms are classified by taxonomy into groups such as multicellular animals, plants, and fu ...
as of July 2022), while GenBank includes sequences for any organism submitted (approximately 504,000 formally described
species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriat ...
).


RefSeq categories

RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are: For more details and more categories, se
Table 1
i
Chapter 18 of the book ''The Reference Sequence (RefSeq) Database''


RefSeq Projects

Several projects to improve ''RefSeq'' services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI: * Consensus CDS (CCDS): This project aims to identify a core set of human and mouse protein-coding regions and standardize sets of genes with high and consistent levels of genomic annotation quality. This project was announced in 2009 and is still in development. * RefSeq Functional Elements (RefSeqFE): It is focused on describing non-genic functional elements which are gene regulatory regions such as: enhancers, silencers, DNase I hypersensitive regions, DNA replication origins etc.). The current scope of this project is restricted to the human and mouse genomes. * RefSeqGene: Its main goal is to define genomic sequences to be used as reference standards for well-characterized genes. Previously described
mRNA In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein. mRNA is created during the ...
, protein and chromosome sequences have the weaknesses of not providing explicit genomic coordinates of gene flanking and intronic regions as well as showing awkwardly large coordinates that change with every new genome assembly. The RefSeqGene project is designed to eliminate these errors. * Targeted Loci: This project records molecular markers, specially protein-coding and
ribosomal RNA Ribosomal ribonucleic acid (rRNA) is a type of non-coding RNA which is the primary component of ribosomes, essential to all cells. rRNA is a ribozyme which carries out protein synthesis in ribosomes. Ribosomal RNA is transcribed from riboso ...
loci that are used for
phylogenetic In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups ...
and barcoding analysis. The scope of this project includes sequences for
Archaea Archaea ( ; singular archaeon ) is a domain of single-celled organisms. These microorganisms lack cell nuclei and are therefore prokaryotes. Archaea were initially classified as bacteria, receiving the name archaebacteria (in the Archaeba ...
,
Bacteria Bacteria (; singular: bacterium) are ubiquitous, mostly free-living organisms often consisting of one biological cell. They constitute a large domain of prokaryotic microorganisms. Typically a few micrometres in length, bacteria were am ...
and
Fungi A fungus ( : fungi or funguses) is any member of the group of eukaryotic organisms that includes microorganisms such as yeasts and molds, as well as the more familiar mushrooms. These organisms are classified as a kingdom, separately fr ...
organisms, accessible via Entrez and
BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...
queries. It also includes GenBank sequences for
Animals Animals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, are able to move, can reproduce sexually, and go through an ontogenetic stage in ...
, Plants and
Protists A protist () is any eukaryotic organism (that is, an organism whose cells contain a cell nucleus) that is not an animal, plant, or fungus. While it is likely that protists share a common ancestor (the last eukaryotic common ancestor), the exc ...
, accessible via BLAST queries. * Virus Variation (ViV): It is an specific resource of sequence data processing pipelines and analysis tools for display and retrieval of sequences from several viral groups such as
influenza virus ''Orthomyxoviridae'' (from Greek ὀρθός, ''orthós'' 'straight' + μύξα, ''mýxa'' 'mucus') is a family of negative-sense RNA viruses. It includes seven genera: ''Alphainfluenzavirus'', ''Betainfluenzavirus'', '' Gammainfluenzavirus'', ' ...
, ebolavirus, MERS coronavirus or
Zika virus ''Zika virus'' (ZIKV; pronounced or ) is a member of the virus family (biology), family ''Flaviviridae''. It is mosquito-borne disease, spread by daytime-active ''Aedes'' mosquitoes, such as ''Aedes aegypti, A. aegypti'' and ''Aedes albopict ...
. New viruses, processing pipelines, tools and other features are included regularly. * RefSeq Select: This project aims to select datasets of RefSeq Select transcripts, as the most representative for every protein-coding gene, based on multiple criteria: prior use in clinical databases, transcript expression, evolutionary conservation of the coding region etc. Since many genes are represented by multiple ''RefSeq'' transcripts/proteins due to the biological process of
alternative splicing Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be i ...
, this complexity is problematic for studies such as
comparative genomics Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural ...
or exchange of clinical variant data. * MANE (Matched Annotation from the NCBI and EMBL-EBI): It is a collaborative project between NCBI and
EMBL The European Molecular Biology Laboratory (EMBL) is an intergovernmental organization dedicated to molecular biology research and is supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and ...
- EBI whose main goal is to define a set of transcripts and their proteins for all the protein-coding genes in the human genome. By doing that, the differences in transcripts annotation between ''RefSeq'' and
Ensembl Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other v ...
/ GENCODE annotation systems are reduced. A MANE Select transcripts set are created as a useful universal standard for clinical reporting and comparative or evolutionary genomics. A second MANE Plus Clinical set are also created with additional transcripts to report all ''Pathogenic'' (P) or ''Likely Pathogenic'' (LP) clinical variants available in public resources. This project was announced in 2018 and is expected to finish in 2022.


Statistics

According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows: The counts of accession and basepairs per molecule type are:


See also

* GenBank * Sequence analysis * Sequence profiling tool *
Sequence motif In biology, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and usually assumed to be related to biological function of the macromolecule. For example, an ''N''-glycosylation site motif can be defined as ' ...
*
UniProt UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from ...
* List of sequenced eukaryotic genomes *
List of sequenced archaeal genomes This list of sequenced archaeal genomes contains all the archaea known to have publicly available complete genome sequences that have been assembled, annotated and deposited in public databases. ''Methanococcus jannaschii'' was the first archaeon w ...


References


Sources

*{{NCBI-handbook


External links


RefSeq

GenBank, RefSeq, TPA and UniProt: What's in a Name?
Genetics databases National Institutes of Health