
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including
genomics
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, ...
,
proteomics
Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replicatio ...
,
metabolomics
Metabolomics is the scientific study of chemical processes involving metabolites, the small molecule substrates, intermediates, and products of cell metabolism. Specifically, metabolomics is the "systematic study of the unique chemical fingerpri ...
,
microarray
A microarray is a multiplex (assay), multiplex lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of biological interactions. It is a two-dimensional array on a Substrate (materials science), solid substrate—usu ...
gene expression, and
phylogenetics
In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...
. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.
Biological databases can be classified by the kind of data they collect (see below). Broadly, there are molecular databases (for sequences, molecules, etc.), functional databases (for physiology, enzyme activities, phenotypes, ecology etc), taxonomic databases (for species and other taxonomic ranks), images and other media, or specimens (for museum collections etc.)
Databases are important tools in assisting scientists to analyze and explain a host of biological phenomena from the structure of
biomolecule
A biomolecule or biological molecule is loosely defined as a molecule produced by a living organism and essential to one or more typically biological processes. Biomolecules include large macromolecules such as proteins, carbohydrates, lipids ...
s and their interaction, to the whole
metabolism
Metabolism (, from ''metabolē'', "change") is the set of life-sustaining chemical reactions in organisms. The three main functions of metabolism are: the conversion of the energy in food to energy available to run cellular processes; the co ...
of organisms and to understanding the
evolution
Evolution is the change in the heritable Phenotypic trait, characteristics of biological populations over successive generations. It occurs when evolutionary processes such as natural selection and genetic drift act on genetic variation, re ...
of
species
A species () is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of Taxonomy (biology), ...
. This knowledge helps facilitate the fight against diseases, assists in the development of
medication
Medication (also called medicament, medicine, pharmaceutical drug, medicinal product, medicinal drug or simply drug) is a drug used to medical diagnosis, diagnose, cure, treat, or preventive medicine, prevent disease. Drug therapy (pharmaco ...
s, predicting certain genetic diseases and in discovering basic relationships among species in the
history of life
The history of life on Earth traces the processes by which living and extinct organisms evolved, from the earliest emergence of life to the present day. Earth formed about 4.5 billion years ago (abbreviated as ''Ga'', for '' gigaannum'') and ...
.
Technical basis and theoretical concepts
Relational database
A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970.
A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
concepts of
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
and
Information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
concepts of
digital libraries are important for understanding biological databases. Biological database design, development, and long-term management is a core area of the discipline of
bioinformatics
Bioinformatics () is an interdisciplinary field of science that develops methods and Bioinformatics software, software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, ...
. Data contents include gene sequences, textual descriptions, attributes and
ontology
Ontology is the philosophical study of existence, being. It is traditionally understood as the subdiscipline of metaphysics focused on the most general features of reality. As one of the most fundamental concepts, being encompasses all of realit ...
classifications, citations, and tabular data. These are often described as semi-
structured data, and can be represented as tables, key delimited records, and
XML
Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
structures.
Access
Most biological databases are available through web sites that organise data such that users can browse through the data online. In addition the underlying data is usually available for download in a variety of formats.
Biological data comes in many formats. These formats include text, sequence data, protein structure and links. Each of these can be found from certain sources, for example:
* Text formats are provided by
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
and
OMIM.
* Sequence data is provided by
GenBank
The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a par ...
, in terms of DNA, and
UniProt, in terms of protein.
* Protein structures are provided by
PDB,
SCOP
A ( or ) was a poet as represented in Old English poetry. The scop is the Old English counterpart of the Old Norse ', with the important difference that "skald" was applied to historical persons, and scop is used, for the most part, to designat ...
, and
CATH.
Problems and challenges
Biological knowledge is distributed among countless databases. This sometimes makes it difficult to ensure the consistency of information, e.g. when different names are used for the same species or different data formats. As a consequence, inter-operability is a constant challenge for information exchange. For instance, if a DNA sequence database stores the DNA sequence along the name of a species, a name change of that species may break the links to other databases which may use a different name.
Integrative bioinformatics is one field attempting to tackle this problem by providing unified access. One solution is how biological databases
cross-reference
The term cross-reference (abbreviation: xref) can refer to either:
* An instance within a document which refers to related information elsewhere in the same document. In both printed and online dictionaries cross-references are important because ...
to other databases with
accession numbers to link their related knowledge together (e.g. so that the accession number stays the same even if a species name changes). Redundancy is another problem, as many databases must store the same information, e.g.
protein structure databases also contain the sequence of the proteins they cover, their sequence, and their bibliographic information.
Model-organism databases
Species-specific databases are available for some species, mainly those that are often used in research (
''model organisms''). For example, EcoCyc is an ''E. coli'' database. Other popular
model organism databases include
Mouse Genome Informatics for the
laboratory mouse, ''Mus musculus'', the
Rat Genome Database for ''Rattus'',
ZFIN for ''Danio Rerio'' (zebrafish),
PomBase for the fission yeast ''Schizosaccharomyces pombe'',
FlyBase for ''Drosophila'',
WormBase for the nematodes ''
Caenorhabditis elegans
''Caenorhabditis elegans'' () is a free-living transparent nematode about 1 mm in length that lives in temperate soil environments. It is the type species of its genus. The name is a Hybrid word, blend of the Greek ''caeno-'' (recent), ''r ...
'' and ''
Caenorhabditis briggsae'', and
Xenbase for ''
Xenopus tropicalis'' and ''
Xenopus laevis'' frogs.
Biodiversity and species databases

Numerous databases attempt to document the diversity of life on earth. A prominent example is the
Catalogue of Life
The Catalogue of Life (CoL) is an online database that provides an index of known species of animals, plants, fungi, and microorganisms. It was created in 2001 as a partnership between the global Species 2000 and the American Integrated Taxono ...
, first created in 2001 by Species 2000 and the Integrated Taxonomic Information System. The Catalogue of Life is a collaborative project that aims to document taxonomic categorization of all currently accepted species in the world. The Catalogue of Life provides a consolidated and consistent database for researchers and policymakers to reference. The Catalogue of Life curates up-to-date datasets from other sources such as Conifer Database,
ICTV MSL (for viruses), and LepIndex (for butterflies and moths). In total, the Catalogue of Life draws from 165 databases as of May 2022. Operational costs of the Catalogue of Life are paid for by the
Global Biodiversity Information Facility
The Global Biodiversity Information Facility (GBIF) is an international organisation that focuses on making scientific data on biodiversity available via the Internet using web services. The data are provided by many institutions from around th ...
, the
Illinois Natural History Survey, the
Naturalis Biodiversity Center, and the
Smithsonian Institution
The Smithsonian Institution ( ), or simply the Smithsonian, is a group of museums, Education center, education and Research institute, research centers, created by the Federal government of the United States, U.S. government "for the increase a ...
.
Some biological databases also document geographical distribution of different species. Shuang Dai et al. created a new multi-source database to document spatial/geographical distribution of 1,371 bird species in China, as existing databases had been severely lacking in spatial distribution data for many species. Sources for this new database included books, literature, GPS tracking, and online webpage data. The new database displayed taxonomy, distribution, species info, and data sources for each species. After completion of the bird spatial distribution database, it was discovered that 61% of known species in China were found to be distributed in regions beyond where they were previously known.
Medical databases

Medical databases are a special case of biomedical data resource and can range from bibliographies, such as
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
, to image databases for the development of AI based diagnostic software. For instance, one such image database was developed with the goal of aiding in the development of wound monitoring algorithms. Over 188 multi-modal image sets were curated from 79 patient visits, consisting of photographs, thermal images, and 3D mesh depth maps. Wound outlines were manually drawn and added to the photo datasets.
The database was made publicly available in the form of a program called WoundsDB, downloadable from the Chronic Wound Database website.
''Nucleic Acids Research'' Database Issue
An important resource for finding biological databases is a special yearly issue of the journal ''
Nucleic Acids Research'' (NAR). The Database Issue of NAR is freely available, and categorizes many of the public biological databases. A companion database to the issue called the Online Molecular Biology Database Collection lists 1,380 online databases.
Other collections of databases exist such as MetaBase and the Bioinformatics Links Collection.
See also
*
Biobank
*
Biological data
*
Chemical database A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data.
Types of chemical databases
Bioactiv ...
*
Death Domain database
*
European Bioinformatics Institute
*
Gene Disease Database
*
Integrative bioinformatics
*
List of biological databases
*
Model organism databases
*
NCBI
*
PubMed
PubMed is an openly accessible, free database which includes primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institute ...
(a database of biomedical literature)
References
External links
Interactive list of biological databases classified by categories, from
Nucleic Acids Research, 2010
DBD: Database of Biological DatabasesBiosharing(a database of biological databases)
Chronic Wounds DatabaseWoundsDB
Catalogue of LifeCatalogue of Life
{{Personal genomics