Chemical databases
   HOME

TheInfoList



OR:

A chemical database is a
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
specifically designed to store chemical information. This information is about chemical and
crystal structures A crystal or crystalline solid is a solid material whose constituents (such as atoms, molecules, or ions) are arranged in a highly ordered microscopic structure, forming a crystal lattice that extends in all directions. In addition, macrosc ...
, spectra, reactions and syntheses, and thermophysical data.


Types of chemical databases


Bioactivity database

Bioactivity databases correlate structures or other chemical information to bioactivity results taken from bioassays in literature, patents, and screening programs.


Chemical structures

Chemical structure A chemical structure determination includes a chemist's specifying the molecular geometry and, when feasible and necessary, the electronic structure of the target molecule or other solid. Molecular geometry refers to the spatial arrangement of ...
s are traditionally represented using lines indicating
chemical bonds A chemical bond is a lasting attraction between atoms or ions that enables the formation of molecules and crystals. The bond may result from the electrostatic force between oppositely charged ions as in ionic bonds, or through the sharing of ...
between atoms and drawn on paper (2D structural formulae). While these are ideal visual representations for the
chemist A chemist (from Greek ''chēm(ía)'' alchemy; replacing ''chymist'' from Medieval Latin ''alchemist'') is a scientist trained in the study of chemistry. Chemists study the composition of matter and its properties. Chemists carefully describe t ...
, they are unsuitable for computational use and especially for search and storage. Small molecules (also called ligands in drug design applications), are usually represented using lists of atoms and their connections. Large molecules such as proteins are however more compactly represented using the sequences of their amino acid building blocks. Large chemical databases for structures are expected to handle the storage and searching of information on millions of molecules taking
terabytes The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable uni ...
of physical memory.


Literature database

Chemical literature databases correlate structures or other chemical information to relevant references such as academic papers or patents. This type of database includes STN, Scifinder, and
Reaxys Reaxys is a web-based tool for the retrieval of chemistry information and data from published literature, including journals and patents. The information includes chemical compounds, chemical reactions, chemical properties, related bibliographic d ...
. Links to literature are also included in many databases that focus on chemical characterization.


Crystallographic database

Crystallographic databases A crystallographic database is a database specifically designed to store information about the structure of molecules and crystals. Crystals are solids having, in all three dimensions of space, a regularly repeating arrangement of atoms, ions, or m ...
store X-ray crystal structure data. Common examples include
Protein Data Bank The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, ...
and
Cambridge Structural Database The Cambridge Structural Database (CSD) is both a repository and a validated and curated resource for the three-dimensional structural data of molecules generally containing at least carbon and hydrogen, comprising a wide range of organic, metal- ...
.


NMR spectra database

NMR spectra databases correlate chemical structure with NMR data. These databases often include other characterization data such as FTIR and mass spectrometry.


Reactions database

Most chemical databases store information on stable
molecule A molecule is a group of two or more atoms held together by attractive forces known as chemical bonds; depending on context, the term may or may not include ions which satisfy this criterion. In quantum physics, organic chemistry, and bioche ...
s but in databases for reactions also intermediates and temporarily created unstable molecules are stored. Reaction databases contain information about products, educts, and reaction mechanisms.


Thermophysical database

Thermophysical data are information about * phase equilibria including vapor–liquid equilibrium,
solubility In chemistry, solubility is the ability of a substance, the solute, to form a solution with another substance, the solvent. Insolubility is the opposite property, the inability of the solute to form such a solution. The extent of the solub ...
of gases in liquids, liquids in solids (SLE), heats of mixing,
vaporization Vaporization (or vaporisation) of an element or compound is a phase transition from the liquid phase to vapor. There are two types of vaporization: evaporation and boiling. Evaporation is a surface phenomenon, whereas boiling is a bulk phenomenon ...
, and
fusion Fusion, or synthesis, is the process of combining two or more distinct entities into a new whole. Fusion may also refer to: Science and technology Physics *Nuclear fusion, multiple atomic nuclei combining to form one or more different atomic nucl ...
. * caloric data like
heat capacity Heat capacity or thermal capacity is a physical property of matter, defined as the amount of heat to be supplied to an object to produce a unit change in its temperature. The SI unit of heat capacity is joule per kelvin (J/K). Heat capacity ...
, heat of formation and
combustion Combustion, or burning, is a high-temperature exothermic redox chemical reaction between a fuel (the reductant) and an oxidant, usually atmospheric oxygen, that produces oxidized, often gaseous products, in a mixture termed as smoke. Combus ...
, * transport properties like
viscosity The viscosity of a fluid is a measure of its resistance to deformation at a given rate. For liquids, it corresponds to the informal concept of "thickness": for example, syrup has a higher viscosity than water. Viscosity quantifies the inte ...
and
thermal conductivity The thermal conductivity of a material is a measure of its ability to conduct heat. It is commonly denoted by k, \lambda, or \kappa. Heat transfer occurs at a lower rate in materials of low thermal conductivity than in materials of high thermal ...


Chemical structure representation

There are two principal techniques for representing chemical structures in digital databases * As connection tables / adjacency matrices / lists with additional information on bond (edges) and atom attributes (nodes), such as: *:
MDL Molfile Chemical table file (CT File) is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms. ...
, PDB, CML * As a linear string notation based on depth first or breadth first traversal, such as: *: SMILES/SMARTS, SLN, WLN, InChI These approaches have been refined to allow representation of stereochemical differences and charges as well as special kinds of bonding such as those seen in organo-metallic compounds. The principal advantage of a computer representation is the possibility for increased storage and fast, flexible search.


Search


Substructure

Chemists can search databases using parts of structures, parts of their
IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a
monomorphism In the context of abstract algebra or universal algebra, a monomorphism is an injective homomorphism. A monomorphism from to is often denoted with the notation X\hookrightarrow Y. In the more general setting of category theory, a monomorphism ...
) and is a widely studied application of
Graph theory In mathematics, graph theory is the study of ''graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of '' vertices'' (also called ''nodes'' or ''points'') which are conn ...
. The algorithms for searching are computationally intensive, often of O (''n''3) or O (''n''4) time complexity (where ''n'' is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (''i.e.'' SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of
bitstring A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level ...
s representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called 'fingerprints' although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by 'folding', which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length.


Conformation

Search by matching 3D conformation of molecules or by specifying spatial constraints is another feature that is particularly of use in
drug design Drug design, often referred to as rational drug design or simply rational design, is the inventive process of finding new medications based on the knowledge of a biological target. The drug is most commonly an organic small molecule that acti ...
. Searches of this kind can be computationally very expensive. Many approximate methods have been proposed, for instance BCUTS, special function representations, moments of inertia, ray-tracing histograms, maximum distance histograms, shape multipoles to name a few.


Giga Search

Databases of synthesizable and virtual chemicals are getting larger each year, therefore the ability to efficiently mine them is critical for drug discovery projects
MolSoft's
MolCart Giga Search (http://www.molsoft.com/giga-search.html) is the first ever method designed for substructure search of billions of chemicals.


Descriptors

All properties of molecules beyond their structure can be split up into either physico-chemical or
pharmacological Pharmacology is a branch of medicine, biology and pharmaceutical sciences concerned with drug or medication action, where a drug may be defined as any artificial, natural, or endogenous (from within the body) molecule which exerts a biochemica ...
attributes also called descriptors. On top of that, there exist various artificial and more or less standardized naming systems for molecules that supply more or less ambiguous names and synonyms. The
IUPAC name In chemical nomenclature, a preferred IUPAC name (PIN) is a unique name, assigned to a chemical substance and preferred among the possible names generated by IUPAC nomenclature. The "preferred IUPAC nomenclature" provides a set of rules for choo ...
is usually a good choice for representing a molecule's structure in a both human-readable and unique string although it becomes unwieldy for larger molecules. Trivial names on the other hand abound with
homonym In linguistics, homonyms are words which are homographs (words that share the same spelling, regardless of pronunciation), or homophones ( equivocal words, that share the same pronunciation, regardless of spelling), or both. Using this definiti ...
s and synonyms and are therefore a bad choice as a defining database key. While physico-chemical descriptors like molecular weight, ( partial) charge,
solubility In chemistry, solubility is the ability of a substance, the solute, to form a solution with another substance, the solvent. Insolubility is the opposite property, the inability of the solute to form such a solution. The extent of the solub ...
, etc. can mostly be computed directly based on the molecule's structure, pharmacological descriptors can be derived only indirectly using involved multivariate statistics or experimental ( screening,
bioassay A bioassay is an analytical method to determine the concentration or potency of a substance by its effect on living animals or plants (''in vivo''), or on living cells or tissues(''in vitro''). A bioassay can be either quantal or quantitative, dir ...
) results. All of those descriptors can for reasons of computational effort be stored along with the molecule's representation and usually are.


Similarity

There is no single definition of molecular similarity, however the concept may be defined according to the application and is often described as an inverse of a measure of distance in descriptor space. Two molecules might be considered more similar for instance if their difference in molecular weights is lower than when compared with others. A variety of other measures could be combined to produce a multi-variate distance measure. Distance measures are often classified into Euclidean measures and non-Euclidean measures depending on whether the
triangle inequality In mathematics, the triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side. This statement permits the inclusion of degenerate triangles, but ...
holds. Maximum Common Subgraph ( MCS) based substructure search (similarity or distance measure) is also very common. MCS is also used for screening drug like compounds by hitting molecules, which share common subgraph (substructure). Chemicals in the databases may be clustered into groups of 'similar' molecules based on similarities. Both hierarchical and non-hierarchical clustering approaches can be applied to chemical entities with multiple attributes. These attributes or molecular properties may either be determined empirically or computationally derived descriptors. One of the most popular clustering approaches is the Jarvis-Patrick algorithm . In
pharmacological Pharmacology is a branch of medicine, biology and pharmaceutical sciences concerned with drug or medication action, where a drug may be defined as any artificial, natural, or endogenous (from within the body) molecule which exerts a biochemica ...
ly oriented chemical repositories, similarity is usually defined in terms of the biological effects of compounds ( ADME/tox) that can in turn be semiautomatically inferred from similar combinations of physico-chemical descriptors using QSAR methods.


Registration systems

Databases systems for maintaining unique records on
chemical compound A chemical compound is a chemical substance composed of many identical molecules (or molecular entities) containing atoms from more than one chemical element held together by chemical bonds. A molecule consisting of atoms of only one element ...
s are termed as Registration systems. These are often used for chemical indexing,
patent A patent is a type of intellectual property that gives its owner the legal right to exclude others from making, using, or selling an invention for a limited period of time in exchange for publishing an enabling disclosure of the invention."A ...
systems and industrial databases. Registration systems usually enforce uniqueness of the chemical represented in the database through the use of unique representations. By applying rules of precedence for the generation of stringified notations, one can obtain unique/'
canonical The adjective canonical is applied in many contexts to mean "according to the canon" the standard, rule or primary source that is accepted as authoritative for the body of knowledge or literature in that context. In mathematics, "canonical examp ...
' string representations such as 'canonical SMILES'. Some registration systems such as the CAS system make use of algorithms to generate unique
hash code A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
s to achieve the same objective. A key difference between a registration system and a simple chemical database is the ability to accurately represent that which is known, unknown, and partially known. For example, a chemical database might store a molecule with stereochemistry unspecified, whereas a chemical registry system requires the registrar to specify whether the stereo configuration is unknown, a specific (known) mixture, or
racemic In chemistry, a racemic mixture, or racemate (), is one that has equal amounts of left- and right-handed enantiomers of a chiral molecule or salt. Racemic mixtures are rare in nature, but many compounds are produced industrially as racemates. ...
. Each of these would be considered a different record in a chemical registry system. Registration systems also preprocess molecules to avoid considering trivial differences such as differences in halogen ions in chemicals. An example is the
Chemical Abstracts Service CAS (formerly Chemical Abstracts Service) is a division of the American Chemical Society. It is a source of chemical information. CAS is located in Columbus, Ohio, United States. Print periodicals ''Chemical Abstracts'' is a periodical index t ...
(CAS) registration system. See also CAS registry number.


List of Chemical Cartridges

* Accord * Direct * J Chem * CambridgeSoft * Bingo * Pinpoint


List of Chemical Registration Systems

* ChemReg * Register * RegMol * Compound-Registration * Ensemble


Web-based


Tools

The computational representations are usually made transparent to chemists by graphical display of the data. Data entry is also simplified through the use of chemical structure editors. These editors internally convert the graphical data into computational representations. There are also numerous algorithms for the interconversion of various formats of representation. An open-source utility for conversion is OpenBabel. These search and conversion algorithms are implemented either within the database system itself or as is now the trend is implemented as external components that fit into standard relational database systems. Both Oracle and PostgreSQL based systems make use of cartridge technology that allows user defined datatypes. These allow the user to make SQL queries with chemical search conditions (For example, a query to search for records having a phenyl ring in their structure represented as a SMILES string in a SMILESCOL column could be SELECT * FROM CHEMTABLE WHERE SMILESCOL.CONTAINS('c1ccccc1') Algorithms for the conversion of
IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
names to structure representations and vice versa are also used for extracting structural information from text. However, there are difficulties due to the existence of multiple dialects of IUPAC. Work is on to establish a unique IUPAC standard (See InChI).


See also

*
Biological database Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including geno ...
* Beilstein database and Dortmund Data Bank * BindingDB *
ChEBI Chemical Entities of Biological Interest, also known as ChEBI, is a chemical database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies (OBO) effort at the European Bioinf ...
* ChEMBL * Chemisches Zentralblatt Structural Database * ChemSpider * Collaborative Drug Discovery *
Comparative Toxicogenomics Database The Comparative Toxicogenomics Database (CTD) is a public website and research tool launched in November 2004 that curates scientific data describing relationships between chemicals/drugs, genes/proteins, diseases, taxa, phenotypes, GO annotations ...
* Computational Chemistry List * DrugBank *
List of chemical databases This is a list of websites that contain lists of chemicals, or databases of chemical information. There is further detail on the content of these and other resources in a Wikibook of information sources. References {{Reflist * Databases ...
*
List of software for molecular mechanics modeling This is a list of computer programs that are predominantly used for molecular mechanics calculations. See also * Car–Parrinello molecular dynamics * Comparison of force-field implementations *Comparison of nucleic acid simulation software ...
* LOLI Database * NMR spectra database * PubChem * SPRESI database * Colocalization Benchmark Source


References

{{DEFAULTSORT:Chemical Database Computational chemistry Cheminformatics