Substructure Search
   HOME

TheInfoList



OR:

Substructure search (SSS) is a method to retrieve from a
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
only those
chemicals A chemical substance is a unique form of matter with constant chemical composition and characteristic properties. Chemical substances may take the form of a single element or chemical compounds. If two or more chemical substances can be combin ...
matching a pattern of atoms and bonds which a user specifies. It is an application of
graph theory In mathematics and computer science, graph theory is the study of ''graph (discrete mathematics), graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of ''Vertex (graph ...
, specifically subgraph matching in which the query is a hydrogen-depleted molecular graph. The mathematical foundations for the method were laid in the 1870s, when it was suggested that chemical structure drawings were equivalent to graphs with atoms as vertices and bonds as edges. SSS is now a standard part of
cheminformatics Cheminformatics (also known as chemoinformatics) refers to the use of physical chemistry theory with computer and information science techniques—so called "'' in silico''" techniques—in application to a range of descriptive and prescriptive ...
and is widely used by pharmaceutical chemists in
drug discovery In the fields of medicine, biotechnology, and pharmacology, drug discovery is the process by which new candidate medications are discovered. Historically, drugs were discovered by identifying the active ingredient from traditional remedies or ...
. There are many commercial systems that provide SSS, typically having a
graphical user interface A graphical user interface, or GUI, is a form of user interface that allows user (computing), users to human–computer interaction, interact with electronic devices through Graphics, graphical icon (computing), icons and visual indicators such ...
and chemical drawing software. Large publicly-available databases like
PubChem PubChem is a database of Chemistry, chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which ...
and
ChemSpider ChemSpider is a freely accessible online chemical database, database of chemicals owned by the Royal Society of Chemistry. It contains information on more than 100 million molecules from over 270 data sources, each of them receiving a unique ...
can be searched this way, as can
Wikipedia Wikipedia is a free content, free Online content, online encyclopedia that is written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and the wiki software MediaWiki. Founded by Jimmy Wales and La ...
's articles describing individual chemicals.


Definitions

Substructure search is used to retrieve from a database of chemicals those which contain the pattern of atoms and bonds specified by a user. It is implemented using a specialist type of
query language A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve informa ...
and in real-world applications the search may be further constrained using
logical operator In logic, a logical connective (also called a logical operator, sentential connective, or sentential operator) is a logical constant. Connectives can be used to connect logical formulas. For instance in the syntax of propositional logic, the ...
s on additional data held in the database. Thus "return all
carboxylic acids In organic chemistry, a carboxylic acid is an organic acid that contains a carboxyl group () attached to an Substituent, R-group. The general formula of a carboxylic acid is often written as or , sometimes as with R referring to an organyl ...
where a sample of >1 g is available". One definition of "substructure" was provided in 2008: "given two chemical structures A and B, if structure A is fully contained in structure B, then A is a substructure of B, while B is a superstructure of A." In this definition, the word "structure" is not synonymous with "
compound Compound may refer to: Architecture and built environments * Compound (enclosure), a cluster of buildings having a shared purpose, usually inside a fence or wall ** Compound (fortification), a version of the above fortified with defensive struc ...
". If it were, the structure for
ethanol Ethanol (also called ethyl alcohol, grain alcohol, drinking alcohol, or simply alcohol) is an organic compound with the chemical formula . It is an Alcohol (chemistry), alcohol, with its formula also written as , or EtOH, where Et is the ps ...
, would not be a substructure of
propanol There are two isomers of propanol. * 1-Propanol, ''n''-propanol, or propan-1-ol: CH3CH2CH2OH, the most common meaning *2-Propanol, isopropyl alcohol, isopropanol, or propan-2-ol: (CH3)2CHOH See also * Propanal (propionaldehyde) differs in spel ...
, , since the terminal CH3 of ethanol is not ''fully contained'' at the propanol chain two atoms away from the OH group. Instead the query structure is, formally, a hydrogen-depleted molecular graph. The search is thus for substances which contain three atoms and two single bonds connected as C–C–O. Propanol is a "hit", as is
diethyl ether Diethyl ether, or simply ether, is an organic compound with the chemical formula , sometimes abbreviated as . It is a colourless, highly Volatility (chemistry), volatile, sweet-smelling ("ethereal odour"), extremely flammable liquid. It belongs ...
, with C–C–O–C–C. If a user wished to limit the hits to
alcohols In chemistry, an alcohol (), is a type of organic compound that carries at least one hydroxyl () functional group bound to a Saturated and unsaturated compounds, saturated carbon atom. Alcohols range from the simple, like methanol and ethanol ...
, then the query structure would have to be drawn with an "explicit hydrogen", as C–C–O–H and ether would no longer match. In mathematical terms, finding substructures is an application of
graph theory In mathematics and computer science, graph theory is the study of ''graph (discrete mathematics), graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of ''Vertex (graph ...
, specifically subgraph matching.


Examples

Standard conventions used when chemists draw
chemical structure A chemical structure of a molecule is a spatial arrangement of its atoms and their chemical bonds. Its determination includes a chemist's specifying the molecular geometry and, when feasible and necessary, the electronic structure of the target m ...
s need to be considered when implementing substructure search. Historically, the representation of
tautomer In chemistry, tautomers () are structural isomers (constitutional isomers) of chemical compounds that readily interconvert. The chemical reaction interconverting the two is called tautomerization. This conversion commonly results from the reloca ...
forms and
stereochemistry Stereochemistry, a subdiscipline of chemistry, studies the spatial arrangement of atoms that form the structure of molecules and their manipulation. The study of stereochemistry focuses on the relationships between stereoisomers, which are defined ...
has posed difficulties. This can be illustrated using
histidine Histidine (symbol His or H) is an essential amino acid that is used in the biosynthesis of proteins. It contains an Amine, α-amino group (which is in the protonated –NH3+ form under Physiological condition, biological conditions), a carboxylic ...
. : The top row shows the standard two-dimensional chemical drawing for (S)-histidine (the natural isomer of this
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
), its
enantiomer In chemistry, an enantiomer (Help:IPA/English, /ɪˈnænti.əmər, ɛ-, -oʊ-/ Help:Pronunciation respelling key, ''ih-NAN-tee-ə-mər''), also known as an optical isomer, antipode, or optical antipode, is one of a pair of molecular entities whi ...
(R)-histidine and a drawing which conventionally indicates the
racemic mixture In chemistry, a racemic mixture or racemate () is a mixture that has equal amounts (50:50) of left- and right-handed enantiomers of a chiral molecule or salt. Racemic mixtures are rare in nature, but many compounds are produced industrially as r ...
of equal amounts of the R and S forms. The bottom row shows the same three compounds with the
imidazole Imidazole (ImH) is an organic compound with the formula . It is a white or colourless solid that is soluble in water, producing a mildly alkaline solution. It can be classified as a heterocycle, specifically as a diazole. Many natural products, ...
ring drawn in its alternative tautomer form. For histidine, it has been experimentally determined by 15N NMR spectroscopy that the 1-H tautomer is preferred over the 3-H form in samples. Choice of representation for storage in a database can influence substucture searches. All six drawings are hits for a propanol substructure C–C–C–O, as shown in red. However, only the top row would, apparently, be a hit for the blue substructure of 1-H imidazole-4-methyl, as this is not ''fully contained'' in the other three compounds. In fact, each vertical pair is the same chemical substance: tautomers in general cannot be isolated as separate samples. In modern databases, substances are held in a single
canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...
, with checks made for uniqueness. The InChIKey provides one way to do this. (S)-Histidine's standard key is HNDVDQJCIGZPNO-YFKPBYRVSA-N, (R)-histidine's key is HNDVDQJCIGZPNO-RXMQYKEDSA-N and (RS)-histidine's is HNDVDQJCIGZPNO-UHFFFAOYSA-N. The first block of 14 letters is identical for all these substances, as it encodes the molecular graph.


Query interfaces and search algorithms

Most substructure search systems present the user with a
graphical user interface A graphical user interface, or GUI, is a form of user interface that allows user (computing), users to human–computer interaction, interact with electronic devices through Graphics, graphical icon (computing), icons and visual indicators such ...
with a chemical structure drawing component. Query structures may contain bonding patterns such as "single/aromatic" or "any" to provide flexibility. Similarly, the vertices which in an actual compound would be a specific atom may be replaced with an atom list in the query. ''Cis''–''trans'' isomerism at
double bond In chemistry, a double bond is a covalent bond between two atoms involving four bonding electrons as opposed to two in a single bond. Double bonds occur most commonly between two carbon atoms, for example in alkenes. Many double bonds exist betw ...
s is catered for by giving a choice of retrieving only the E form, the Z form, or both. The algorithms for searching are computationally intensive, often of O (''n''3) or O (''n''4) time complexity (where ''n'' is the number of atoms involved) but the problem is known to be
NP-complete In computational complexity theory, NP-complete problems are the hardest of the problems to which ''solutions'' can be verified ''quickly''. Somewhat more precisely, a problem is NP-complete when: # It is a decision problem, meaning that for any ...
. Speedups are achieved using fragment screening as a first step. This pre-computation typically involves creation of
bitstring A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...
s representing presence or absence of molecular fragments. Target compounds that do not possess the fragments present in the query cannot be hits and are eliminated. Atom-by-atom-searching, in which a mapping of the query's atoms and bonds with the target molecule is sought, is usually done with a variant of the Ullman algorithm.


Implementations

, substructure search is a standard feature in chemical databases accessible via the web. Large databases such as
PubChem PubChem is a database of Chemistry, chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which ...
, maintained by the
National Center for Biotechnology Information The National Center for Biotechnology Information (NCBI) is part of the National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is lo ...
and
ChemSpider ChemSpider is a freely accessible online chemical database, database of chemicals owned by the Royal Society of Chemistry. It contains information on more than 100 million molecules from over 270 data sources, each of them receiving a unique ...
, maintained by the
Royal Society of Chemistry The Royal Society of Chemistry (RSC) is a learned society and professional association in the United Kingdom with the goal of "advancing the chemistry, chemical sciences". It was formed in 1980 from the amalgamation of the Chemical Society, the ...
have graphical interfaces for search. The
Chemical Abstracts Service Chemical Abstracts Service (CAS) is a division of the American Chemical Society. It is a source of chemical information and is located in Columbus, Ohio, United States. Print periodicals ''Chemical Abstracts'' is a periodical index that provid ...
, a division of the
American Chemical Society The American Chemical Society (ACS) is a scientific society based in the United States that supports scientific inquiry in the field of chemistry. Founded in 1876 at New York University, the ACS currently has more than 155,000 members at all ...
, provides tools to search the chemical literature and
Reaxys Reaxys is a web-based tool for the retrieval of information about chemical compounds and data from published literature, including journals and patents. The information includes chemical compounds, chemical reactions, chemical properties, related ...
supplied by
Elsevier Elsevier ( ) is a Dutch academic publishing company specializing in scientific, technical, and medical content. Its products include journals such as ''The Lancet'', ''Cell (journal), Cell'', the ScienceDirect collection of electronic journals, ...
covers both chemicals and reaction information, including that originally held in the
Beilstein database The Beilstein database is a database in the field of organic chemistry, in which compounds are uniquely identified by their Beilstein Registry Number. The database covers the scientific literature from 1771 to the present and contains experimenta ...
.
PATENTSCOPE PATENTSCOPE is a global patent database and search system developed and maintained by the World Intellectual Property Organization. It provides free and open access to a vast collection of international patent documents, including patent applicatio ...
maintained by the
World Intellectual Property Organization The World Intellectual Property Organization (WIPO; (OMPI)) is one of the 15 specialized agencies of the United Nations (UN). Pursuant to the 1967 Convention Establishing the World Intellectual Property Organization, WIPO was created to pr ...
makes chemical patents accessible by substructure and Wikipedia's articles describing individual chemicals can also be searched that way. Suppliers of chemicals as synthesis intermediates or for
high-throughput screening High-throughput screening (HTS) is a method for scientific discovery especially used in drug discovery and relevant to the fields of biology, materials science and chemistry. Using robotics, data processing/control software, liquid handling device ...
routinely provide search interfaces. Currently, the largest database that can be freely searched by the public is the
ZINC database The ZINC database (recursive acronym: ''ZINC is not commercial'') is a curated collection of commercially available chemical compounds prepared especially for virtual screening. ZINC is used by investigators (generally people with training as bio ...
, which is claimed to contain over 37 billion commercially available molecules.


History

The idea that chemical structures as depicted using drawings of the type introduced by Kekulé were related to what is now called
graph theory In mathematics and computer science, graph theory is the study of ''graph (discrete mathematics), graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of ''Vertex (graph ...
was suggested by the mathematician J. J. Sylvester in 1878. He was the first to use the word "graph" in the sense of a
network Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...
.
Arthur Cayley Arthur Cayley (; 16 August 1821 – 26 January 1895) was a British mathematician who worked mostly on algebra. He helped found the modern British school of pure mathematics, and was a professor at Trinity College, Cambridge for 35 years. He ...
had already, in 1874, considered how to enumerate chemical
isomers In chemistry, isomers are molecules or polyatomic ions with identical molecular formula – that is, the same number of atoms of each element – but distinct arrangements of atoms in space. ''Isomerism'' refers to the existence or possibili ...
, in what was an early approach to
molecular graph In chemical graph theory and in mathematical chemistry, a molecular graph or chemical graph is a representation of the structural formula of a chemical compound in terms of graph theory. A chemical graph is a labeled graph whose vertices correspo ...
s, where atoms are at vertices and bonds correspond to edges. In the 20th century, chemists developed standard ways to show
structural formula The structural formula of a chemical compound is a graphic representation of the molecular structure (determined by structural chemistry methods), showing how the atoms are connected to one another. The chemical bonding within the molecule is al ...
, especially for individual
organic compounds Some chemical authorities define an organic compound as a chemical compound that contains a carbon–hydrogen or carbon–carbon bond; others consider an organic compound to be any chemical compound that contains carbon. For example, carbon-co ...
that were increasingly being synthesized and tested as potential drugs or agrochemicals, By the 1950s, as the number of compounds made and tested grew, the first attempts to create
chemical databases A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data. Types of chemical databases Bioactiv ...
were made and the sub-discipline of
cheminformatics Cheminformatics (also known as chemoinformatics) refers to the use of physical chemistry theory with computer and information science techniques—so called "'' in silico''" techniques—in application to a range of descriptive and prescriptive ...
was established. As stated in 2012, "searching for substructures in molecules belongs to the most elementary tasks in cheminformatics and is nowadays part of virtually every cheminformatics software". The first suggested use for substructure search was in 1957, to reduce the workload of
patent examiners A patent examiner (or, historically, a patent clerk) is an employee, usually a civil servant with a scientific or engineering background, working at a patent office. Duties Due to a long-standing and incessantly growing backlog of unexamined paten ...
. They have to search published literature to decide whether an invention is novel, which for
chemical patents A chemical patent, pharmaceutical patent or drug patent is a patent for an invention in the chemical or pharmaceuticals industry. Strictly speaking, in most jurisdictions, there are essentially no differences between the legal requirements to o ...
often means finding known examples within the generic claims of a Markush structure. Before this could become a reality, a number of developments were required. Importantly, the existing literature had to be made searchable and a way to input a chemical structure query and return the matching results had to devised. These requirements had been partially met as early as 1881 when
Friedrich Konrad Beilstein Friedrich Konrad Beilstein (; 17 February 183818 October 1906), was a Russian chemist and founder of the famous ''Handbuch der organischen Chemie'' (''Handbook of Organic Chemistry''). The first edition of this work, published in 1881, covered 1, ...
introduced the '' Handbuch der organischen Chemie'' (''Handbook of Organic Chemistry'') which carefully classified known chemicals in a very systematic manner so that all examples containing a given
heterocycle A heterocyclic compound or ring structure is a cyclic compound that has atoms of at least two different elements as members of its ring(s). Heterocyclic organic chemistry is the branch of organic chemistry dealing with the synthesis, proper ...
would be located together. In 1907, the American Chemical Society set up the
Chemical Abstracts Service Chemical Abstracts Service (CAS) is a division of the American Chemical Society. It is a source of chemical information and is located in Columbus, Ohio, United States. Print periodicals ''Chemical Abstracts'' is a periodical index that provid ...
(CAS). This weekly subscription service included a printed publication with summaries of articles in thousands of scholarly journals and claims in worldwide patents. This had a chemical substance index that, in principle, allowed searching by chemical name or formula. However, it was only when the CAS records had been fully converted into
machine-readable In communications and computing, a machine-readable medium (or computer-readable medium) is a medium capable of storing data in a format easily readable by a digital computer or a sensor. It contrasts with ''human-readable'' medium and data. T ...
form and the internet was available to connect its database to end-users that comprehensive searching became possible. CAS provided various specialist search services from the 1980s but it was not until 2008 that its "SciFinder" system became available via
the web The World Wide Web (WWW or simply the Web) is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists. It allows documents and other web ...
. By the 1960s, companies synthesizing and testing new chemicals made significant progress in creating in-house databases.
Imperial Chemical Industries Imperial Chemical Industries (ICI) was a British Chemical industry, chemical company. It was, for much of its history, the largest manufacturer in Britain. Its headquarters were at Millbank in London. ICI was listed on the London Stock Exchange ...
stored chemical structures encoded as text
strings String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
, using Wiswesser line notation. Its associated CROSSBOW software allowed substructure search using key-based searches followed by more processor-intensive atom-by-atom search. It was recognised that research chemists wanted not only to search company collections for existing inventory but also to search third-party databases supplied by vendors of small-molecule intermediates. The latter application evolved as a collaboration involving six companies with pharmaceutical interests and their commercial suppliers. By the 1980s, other
line notation Line notation is a typographical notation system using ASCII characters, most often used for chemical nomenclature. Chemistry * Cell notation for representation of an electrochemical cell * Dyson / IUPAC (1944) * Hayward (1961) * International Ch ...
s were used for commercially-available substructure search systems.
SMILES A smile is a facial expression formed primarily by flexing the muscles at the sides of the mouth. Some smiles include a contraction of the muscles at the corner of the eyes, an action known as a Duchenne smile. Among humans, a smile expresses d ...
encoding, together with its SMARTS query language, and SYBYL line notation are examples. A comprehensive survey of then-available chemical information systems was produced for
NASA The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the federal government of the United States, US federal government responsible for the United States ...
in 1985. The need to combine chemistry search with biological data produced by screening compounds at ever-larger scales led to implementation of systems such as MACCS. This commercial system from
MDL Information Systems MDL Information Systems, Inc. was a provider of R&D informatics products for the life sciences and chemicals industries. The company was launched as a computer-aided drug design firm (originally named Molecular Design Limited, Inc.) in January 197 ...
made use of an algorithm specifically designed for storage and search within groups of chemicals that differed only in their stereochemistry. A review of the many systems available by the mid-1980s pointed out that "most in-house developed systems have been replaced with commercially available standardised software for managing chemical structure databases." The
MDL Molfile Chemical table file (CT file) is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms. ...
is now an
open file format An open file format is a file format for storing digital data, defined by an openly published specification usually maintained by a standards organization, and which can be used and implemented by anyone. An open file format is licensed with a ...
for storing single-molecule data in the form of a connection table. By the 2000s,
personal computers A personal computer, commonly referred to as PC or computer, is a computer designed for individual use. It is typically used for tasks such as Word processor, word processing, web browser, internet browsing, email, multimedia playback, and PC ...
had become powerful enough that storage and search of chemistry within
office software Productivity software (also called personal productivity software or office productivity software) is application software used for producing information (such as documents, presentations, worksheets, databases, charts, graphs, digital paintings, ...
such as
Microsoft Excel Microsoft Excel is a spreadsheet editor developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android, iOS and iPadOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a ...
was possible. Subsequent developments involved the use of new techniques to allow efficient searches over very large databases and, importantly, the use of a standardised
International Chemical Identifier The International Chemical Identifier (InChI, pronounced ) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on ...
, a type of line notation, to uniquely define a chemical substance.


See also

*
Molecule mining Molecule mining is the process of data mining, or extracting and discovering patterns, as applied to molecules. Since molecules may be represented by molecular graphs, this is strongly related to graph mining and structured data mining. The m ...


References


External links


Wikipedia Chemical Structure Explorer
to search Wikipedia chemistry articles by substructure
Search PubChem

Search ChemSpider

Search ZINC-22
a database of over 50 billion molecules {{DEFAULTSORT:Substructure search NP-complete problems Graph algorithms Computational problems in graph theory Computational chemistry Cheminformatics