International Chemical Identifier
   HOME

TheInfoList



OR:

The International Chemical Identifier (InChI or ) is a textual identifier for
chemical substance A chemical substance is a form of matter having constant chemical composition and characteristic properties. Some references add that chemical substance cannot be separated into its constituent elements by physical separation methods, i.e., w ...
s, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the web. Initially developed by the
International Union of Pure and Applied Chemistry The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
(IUPAC) and
National Institute of Standards and Technology The National Institute of Standards and Technology (NIST) is an agency of the United States Department of Commerce whose mission is to promote American innovation and industrial competitiveness. NIST's activities are organized into physical s ...
(NIST) from 2000 to 2005, the format and algorithms are non-proprietary. Since May 2009, it has been developed by the InChI Trust, a nonprofit charity from the
United Kingdom The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom (UK) or Britain, is a country in Europe, off the north-western coast of the European mainland, continental mainland. It comprises England, Scotlan ...
which works to implement and promote the use of InChI. The identifiers describe chemical substances in terms of ''layers'' of information — the atoms and their bond connectivity,
tautomer Tautomers () are structural isomers (constitutional isomers) of chemical compounds that readily interconvert. The chemical reaction interconverting the two is called tautomerization. This conversion commonly results from the relocation of a hyd ...
ic information,
isotope Isotopes are two or more types of atoms that have the same atomic number (number of protons in their nuclei) and position in the periodic table (and hence belong to the same chemical element), and that differ in nucleon numbers (mass numb ...
information, stereochemistry, and electronic charge information. Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application. The InChI algorithm converts input structural information into a unique InChI identifier in a three-step process: normalization (to remove redundant information), canonicalization (to generate a unique number label for each atom), and serialization (to give a string of characters). InChIs differ from the widely used
CAS registry number A CAS Registry Number (also referred to as CAS RN or informally CAS Number) is a unique identification number assigned by the Chemical Abstracts Service (CAS), US to every chemical substance described in the open scientific literature. It inclu ...
s in three respects: firstly, they are freely usable and non-proprietary; secondly, they can be computed from structural information and do not have to be assigned by some organization; and thirdly, most of the information in an InChI is human readable (with practice). InChIs can thus be seen as akin to a general and extremely formalized version of IUPAC names. They can express more information than the simpler
SMILES The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors f ...
notation and differ in that every structure has a unique InChI string, which is important in database applications. Information about the 3-dimensional coordinates of atoms is not represented in InChI; for this purpose a format such as PDB can be used. The InChIKey, sometimes referred to as a hashed InChI, is a fixed length (27 character) condensed digital representation of the InChI that is not human-understandable. The InChIKey specification was released in September 2007 in order to facilitate web searches for chemical compounds, since these were problematic with the full-length InChI. Unlike the InChI, the InChIKey is not unique: though collisions can be calculated to be very rare, they happen. In January 2009 the 1.02 version of the InChI software was released. This provided a means to generate so called standard InChI, which does not allow for user selectable options in dealing with the stereochemistry and tautomeric layers of the InChI string. The standard InChIKey is then the hashed version of the standard InChI string. The standard InChI will simplify comparison of InChI strings and keys generated by different groups, and subsequently accessed via diverse sources such as databases and web resources. The continuing development of the standard has been supported since 2010 by the not-for-profit InChI Trust, of which IUPAC is a member. The current software version is 1.06 and was released in December 2020. Prior to 1.04, the software was freely available under the open-source LGPL license, but it now uses a custom license called IUPAC-InChI Trust License.


Generation

In order to avoid generating different InChIs for tautomeric structures, before generating the InChI, an input chemical structure is normalized to reduce it to its so-called core parent structure. This may involve changing bond orders, rearranging formal charges and possibly adding and removing protons. Different input structures may give the same result; for example, acetic acid and acetate would both give the same core parent structure, that of acetic acid. A core parent structure may be disconnected, consisting of more than one component, in which case the sublayers in the InChI usually consist of sublayers for each component, separated by semicolons (periods for the chemical formula sublayer.) One way this can happen is that all metal atoms are disconnected during normalization; so, for example, the InChI for tetraethyllead will have five components, one for lead and four for the ethyl groups. The first, main, layer of the InChI refers to this core parent structure, giving its chemical formula, non-hydrogen connectivity without bond order (/c sublayer) and hydrogen connectivity (/h sublayer.) The /q portion of the charge layer gives its charge, and the /p portion of the charge layer tells how many protons (hydrogen ions) must be added to or removed from it to regenerate the original structure. If present, the stereochemical layer, with sublayers /b, /t, /m and /s, gives stereochemical information, and the isotopic layer /i (which may contain sublayers /h, /b, /t, /m and /s) gives isotopic information. These are the only layers which can occur in a standard InChI. If the user wants to specify an exact tautomer, a fixed hydrogen layer /f can be appended, which may contain various additional sublayers; this cannot be done in standard InChI though, so different tautomers will have the same standard InChI (for example, alanine will give the same standard InChI whether input in a neutral or a zwitterionic form.) Finally, a nonstandard reconnected /r layer can be added, which effectively gives a new InChI generated without breaking bonds to metal atoms. This may contain various sublayers, including /f.


Format and layers

Every InChI starts with the string "InChI=" followed by the version number, currently 1. If the InChI is standard, this is followed by the letter S for standard InChIs, which is a fully standardized InChI flavor maintaining the same level of attention to structure details and the same conventions for drawing perception. The remaining information is structured as a sequence of layers and sub-layers, with each layer providing one specific type of information. The layers and sub-layers are separated by the delimiter "/" and start with a characteristic prefix letter (except for the chemical formula sub-layer of the main layer). The six layers with important sublayers are: #Main layer #*
Chemical formula In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule, using chemical element symbols, numbers, and sometimes also other symbol ...
(no prefix). This is the only sublayer that must occur in every InChI. Numbers used throughout the InChI are given in the formula's element order excluding hydrogen atoms. For example, “/C10H16N5O13P3” implies that atoms numbered 1–10 are carbons, 11–15 are nitrogens, 16–28 are oxygens, and 29–31 are phosphorus. #* Atom connections (prefix: "c"). The atoms in the chemical formula (except for hydrogens) are numbered in sequence; this sublayer describes which atoms are connected by bonds to which other ones. #*
Hydrogen Hydrogen is the chemical element with the symbol H and atomic number 1. Hydrogen is the lightest element. At standard conditions hydrogen is a gas of diatomic molecules having the formula . It is colorless, odorless, tasteless, non-toxic ...
atoms (prefix: "h"). Describes how many hydrogen atoms are connected to each of the other atoms. #
Charge Charge or charged may refer to: Arts, entertainment, and media Films * '' Charge, Zero Emissions/Maximum Speed'', a 2011 documentary Music * ''Charge'' (David Ford album) * ''Charge'' (Machel Montano album) * ''Charge!!'', an album by The Aqu ...
layer #* charge sublayer (prefix: "q") #* proton sublayer (prefix: "p" for "protons") #
Stereochemical Stereochemistry, a subdiscipline of chemistry, involves the study of the relative spatial arrangement of atoms that form the structure of molecules and their manipulation. The study of stereochemistry focuses on the relationships between stereois ...
layer #* double bonds and
cumulene In organic chemistry, a cumulene is a compound having three or more ''cumulative'' (consecutive) double bonds. They are analogous to allenes, only having a more extensive chain. The simplest molecule in this class is butatriene (), which is al ...
s (prefix: "b") #* tetrahedral stereochemistry of atoms and
allene In organic chemistry, allenes are organic compounds in which one carbon atom has double bonds with each of its two adjacent carbon centres (). Allenes are classified as cumulated dienes. The parent compound of this class is propadiene, which ...
s (prefixes: "t", "m") #* type of stereochemistry information (prefix: "s") # Isotopic layer (prefixes: "i", "h", as well as "b", "t", "m", "s" for isotopic stereochemistry) # Fixed-H layer (prefix: "f"); contains some or all of the above types of layers except atom connections; may end with "o" sublayer; never included in standard InChI # Reconnected layer (prefix: "r"); contains the whole InChI of a structure with reconnected metal atoms; never included in standard InChI The delimiter-prefix format has the advantage that a user can easily use a
wildcard Wild card most commonly refers to: * Wild card (cards), a playing card that substitutes for any other card in card games * Wild card (sports), a tournament or playoff place awarded to an individual or team that has not qualified through normal pla ...
search to find identifiers that match only in certain layers.


InChIKey

The condensed, 27 character InChIKey is a hashed version of the full InChI (using the
SHA-256 SHA-2 (Secure Hash Algorithm 2) is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compressi ...
algorithm), designed to allow for easy web searches of chemical compounds. The standard InChIKey is the hashed counterpart of standard InChI. Most chemical structures on the Web up to 2007 have been represented as GIF files, which are not searchable for chemical content. The full InChI turned out to be too lengthy for easy searching, and therefore the InChIKey was developed. There is a very small, but nonzero chance of two different molecules having the same InChIKey, but the probability for duplication of only the first 14 characters has been estimated as only one duplication in 75 databases each containing one billion unique structures. With all databases currently having below 50 million structures, such duplication appears unlikely at present. A recent study more extensively studies the collision rate finding that the experimental collision rate is in agreement with the theoretical expectations. The InChIKey currently consists of three parts separated by hyphens, of 14, 10 and one character(s), respectively, like XXXXXXXXXXXXXX-YYYYYYYYFV-P. The first 14 characters result from a SHA-256 hash of the connectivity information (the main layer and /q sublayer of the charge layer) of the InChI. The second part consists of 8 characters resulting from a hash of the remaining layers of the InChI, a single character indicating the kind of InChIKey (S for standard and N for nonstandard), and a character indicating the version of InChI used (currently A for version 1.) Finally, the single character at the end indicates the
protonation In chemistry, protonation (or hydronation) is the adding of a proton (or hydron, or hydrogen cation), (H+) to an atom, molecule, or ion, forming a conjugate acid. (The complementary process, when a proton is removed from a Brønsted–Lowry acid ...
of the core parent structure, corresponding to the /p sublayer of the charge layer (N for no protonation, O, P, ... if protons should be added and M, L, ... if they should be removed.)


Example

Morphine Morphine is a strong opiate that is found naturally in opium, a dark brown resin in poppies (''Papaver somniferum''). It is mainly used as a pain medication, and is also commonly used recreationally, or to make other illicit opioids. T ...
has the structure shown on the right. The standard InChI for morphine is InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1 and the standard InChIKey for morphine is BQJCRHHNABKAKU-KBQPJGBKSA-N.


InChI resolvers

As the InChI cannot be reconstructed from the InChIKey, an InChIKey always needs to be linked to the original InChI to get back to the original structure. InChI Resolvers act as a lookup service to make these links, and prototype services are available from
National Cancer Institute The National Cancer Institute (NCI) coordinates the United States National Cancer Program and is part of the National Institutes of Health (NIH), which is one of eleven agencies that are part of the U.S. Department of Health and Human Services. ...
, th
UniChem service
at the
European Bioinformatics Institute The European Bioinformatics Institute (EMBL-EBI) is an Intergovernmental Organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wel ...
, and
PubChem PubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of ...
.
ChemSpider ChemSpider is a database of chemicals. ChemSpider is owned by the Royal Society of Chemistry. Database The database contains information on more than 100 million molecules from over 270 data sources including: * EPA DSSTox * U.S. Food and D ...
has had a resolver until July 2015 when it was decommissioned.


Name

The format was originally called IChI (IUPAC Chemical Identifier), then renamed in July 2004 to INChI (IUPAC-NIST Chemical Identifier), and renamed again in November 2004 to InChI (IUPAC International Chemical Identifier), a trademark of IUPAC.


Continuing development

Scientific direction of the InChI standard is carried out by the IUPAC Division VIII Subcommittee, and funding of subgroups investigating and defining the expansion of the standard is carried out by both
IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
and the
InChI Trust The International Chemical Identifier (InChI or ) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on the we ...
. The InChI Trust funds the development, testing and documentation of the InChI. Current extensions are being defined to handle
polymer A polymer (; Greek '' poly-'', "many" + ''-mer'', "part") is a substance or material consisting of very large molecules called macromolecules, composed of many repeating subunits. Due to their broad spectrum of properties, both synthetic a ...
s and mixtures,
Markush structure A Markush structure is a representation of chemical structure used to indicate a group of related chemical compounds. They are commonly used in chemistry texts and in patent claims. Markush structures are depicted with multiple independently var ...
s, reactions and organometallics, and once accepted by the Division VIII Subcommittee will be added to the algorithm.


Software

The InChI Trust has developed software to generate the InChI, InChIKey and other identifiers. The release history of this software follows.


Adoption

The InChI has been adopted by many larger and smaller databases, including
ChemSpider ChemSpider is a database of chemicals. ChemSpider is owned by the Royal Society of Chemistry. Database The database contains information on more than 100 million molecules from over 270 data sources including: * EPA DSSTox * U.S. Food and D ...
,
ChEMBL ChEMBL or ChEMBLdb is a manually curated chemical database of bioactive molecules with drug-like properties. It is maintained by the European Bioinformatics Institute (EBI), of the European Molecular Biology Laboratory ( EMBL), based at the Wel ...
,
Golm Metabolome Database The Golm Metabolome Database (GMD) is a gas chromatography–mass spectrometry, gas chromatography (GC) – mass spectrometry (MS) reference library dedicated to metabolite profiling experiments and comprises mass spectral and retention index (R ...
, OpenPHACTS, and
PubChem PubChem is a database of chemical molecules and their activities against biological assays. The system is maintained by the National Center for Biotechnology Information (NCBI), a component of the National Library of Medicine, which is part of ...
. However, the adoption is not straightforward, and many databases show a discrepancy between the chemical structures and the InChI they contain, which is a problem for linking databases.


See also

*
Molecular Query Language The Molecular Query Language (MQL) was designed to allow more complex, problem-specific search methods in chemoinformatics. In contrast to the widely used SMARTS queries, MQL provides for the specification of spatial and physicochemical properties ...
*
Simplified molecular-input line-entry system The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors f ...
(SMILES) *
Molecule editor A molecule editor is a computer program for creating and modifying representations of chemical structures. Molecule editors can manipulate chemical structure representations in either a simulated two-dimensional space or three-dimensional space, v ...
* SYBYL Line Notation * Bioclipse generates InChI and InChIKeys for drawn structures or opened files * the
Chemistry Development Kit The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed ...
uses JNI-InChI to generate InChIs, can convert InChIs into structures, and generate tautomers based on the InChI algorithms


Notes and references


External links


IUPAC InChI site

Description of the canonicalization algorithm

Googling for InChIs
a presentation to the W3C.

InChI final version 1.02 and explanation of Standard InChI, January 2009
NCI/CADD Chemical Identifier Resolver
Generates and resolves InChI/InChIKeys and many other chemical identifiers

that supports
SMILES The simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. SMILES strings can be imported by most molecule editors f ...
/SMARTS and InChI
ChemSpider Compound APIs
ChemSpider
REST Rest or REST may refer to: Relief from activity * Sleep ** Bed rest * Kneeling * Lying (position) * Sitting * Squatting position Structural support * Structural support ** Rest (cue sports) ** Armrest ** Headrest ** Footrest Arts and enter ...
API An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
that allows generation of InChI and conversion of InChI to structure (also SMILES and generation of other properties)
MarvinSketch
from ChemAxon, implementation to draw structures (or open other file formats) and output to InChI file format
BKchem
implements its own InChI parser and uses the IUPAC implementation to generate InChI strings
CompoundSearch
implements an InChI and InChI Key search of spectral libraries
SpectraBase
implements an InChI and InChI Key search of spectral libraries
JSME
is a free JavaScript based molecular editor that generates InChI and InChI Key in a web browser, which allows for easy web searches of chemical compounds {{Molecular visualization Chemical nomenclature Encodings Chemical file formats Identifiers Open formats