SMIRKS
   HOME

TheInfoList



OR:

The Simplified Molecular Input Line Entry System (SMILES) is a specification in the form of a
line notation Line notation is a typographical notation system using ASCII characters, most often used for chemical nomenclature. Chemistry * Cell notation for representation of an electrochemical cell * Dyson / IUPAC (1944) * Hayward (1961) * International Ch ...
for describing the structure of
chemical species Chemical species are a specific form of chemical substance or chemically identical molecular entities that have the same molecular energy level at a specified timescale. These entities are classified through bonding types and relative abundance of ...
using short
ASCII ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable character, printable and 33 control character, control c ...
strings String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
. SMILES strings can be imported by most
molecule editor A notable molecule editor is a computer program for creating and modifying representations of chemical structures. Molecule editors can manipulate chemical structure representations in either a simulated two-dimensional space or three-dimensional ...
s for conversion back into
two-dimensional A two-dimensional space is a mathematical space with two dimensions, meaning points have two degrees of freedom: their locations can be locally described with two coordinates or they can move in two independent directions. Common two-dimension ...
drawings or
three-dimensional In geometry, a three-dimensional space (3D space, 3-space or, rarely, tri-dimensional space) is a mathematical space in which three values (''coordinates'') are required to determine the position (geometry), position of a point (geometry), poi ...
models of the molecules. The original SMILES specification was initiated in the 1980s. It has since been modified and extended. In 2007, an
open standard An open standard is a standard that is openly accessible and usable by anyone. It is also a common prerequisite that open standards use an open license that provides for extensibility. Typically, anybody can participate in their development due to ...
called OpenSMILES was developed in the
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
chemistry community.


History

The original SMILES specification was initiated by David Weininger at the USEPA Mid-Continent Ecology Division Laboratory in
Duluth Duluth ( ) is a Port, port city in the U.S. state of Minnesota and the county seat of St. Louis County, Minnesota, St. Louis County. Located on Lake Superior in Minnesota's Arrowhead Region, the city is a hub for cargo shipping. The population ...
in the 1980s. Acknowledged for their parts in the early development were "Gilman Veith and Rose Russo (USEPA) and Albert Leo and Corwin Hansch (
Pomona College Pomona College ( ) is a private university, private Liberal arts colleges in the United States, liberal arts college in Claremont, California. It was established in 1887 by a group of Congregationalism in the United States, Congregationalists ...
) for supporting the work, and Arthur Weininger (Pomona; Daylight CIS) and Jeremy Scofield (Cedar River Software, Renton, WA) for assistance in programming the system." The
Environmental Protection Agency Environmental Protection Agency may refer to the following government organizations: * Environmental Protection Agency (Queensland), Australia * Environmental Protection Agency (Ghana) * Environmental Protection Agency (Ireland) * Environmenta ...
funded the initial project to develop SMILES. It has since been modified and extended by others, most notably by
Daylight Chemical Information Systems Daylight is the combination of all direct and indirect sunlight during the daytime. This includes direct sunlight, diffuse sky radiation, and (often) both of these reflected by Earth and terrestrial objects, like landforms and buildings. Sunlig ...
. In 2007, an
open standard An open standard is a standard that is openly accessible and usable by anyone. It is also a common prerequisite that open standards use an open license that provides for extensibility. Typically, anybody can participate in their development due to ...
called "OpenSMILES" was developed by the
Blue Obelisk Blue Obelisk is an informal group of chemists who promote open data, Open-source model, open source, and open standards; it was initiated by Peter Murray-Rust and others in 2005. Multiple open source cheminformatics projects associate themselves w ...
open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc). In July 2006, the
IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
introduced the
InChI The International Chemical Identifier (InChI, pronounced ) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on ...
as a standard for formula representation. SMILES is generally considered to have the advantage of being more human-readable than InChI; it also has a wide base of software support with extensive theoretical backing (such as
graph theory In mathematics and computer science, graph theory is the study of ''graph (discrete mathematics), graphs'', which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of ''Vertex (graph ...
).


Terminology

The term SMILES refers to a line notation for encoding molecular structures and specific instances should strictly be called SMILES strings. However, the term SMILES is also commonly used to refer to both a single SMILES string and a number of SMILES strings; the exact meaning is usually apparent from the context. The terms "canonical" and "isomeric" can lead to some confusion when applied to SMILES. The terms describe different attributes of SMILES strings and are not mutually exclusive. Typically, a number of equally valid SMILES strings can be written for a molecule. For example, CCO, OCC and C(O)C all specify the structure of
ethanol Ethanol (also called ethyl alcohol, grain alcohol, drinking alcohol, or simply alcohol) is an organic compound with the chemical formula . It is an Alcohol (chemistry), alcohol, with its formula also written as , or EtOH, where Et is the ps ...
. Algorithms have been developed to generate the same SMILES string for a given molecule; of the many possible strings, these algorithms choose only one of them. This SMILES is unique for each structure, although dependent on the
canonicalization In computer science, canonicalization (sometimes standardization or Normalization (statistics), normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This ...
algorithm used to generate it, and is termed the canonical SMILES. These algorithms first convert the SMILES to an internal representation of the molecular structure; an algorithm then examines that structure and produces a unique SMILES string. Various algorithms for generating canonical SMILES have been developed and include those by Daylight Chemical Information Systems,
OpenEye Scientific Software OpenEye Scientific Software is an American software company founded by Anthony Nicholls in 1997. It develops large-scale molecular modelling applications and toolkits. Following OpenEye's acquisition by Cadence Design Systems for $500million ...
, MEDIT,
Chemical Computing Group Chemical Computing Group is a software company specializing in research software for computational chemistry, bioinformatics, cheminformatics, docking, pharmacophore searching and molecular simulation. The company's main customer base consist ...
, MolSoft LLC, and the
Chemistry Development Kit The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed un ...
. A common application of canonical SMILES is indexing and ensuring uniqueness of molecules in a
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
. The original paper that described the CANGEN algorithm claimed to generate unique SMILES strings for graphs representing molecules, but the algorithm fails for a number of simple cases (e.g.
cuneane Cuneane () is a Saturated and unsaturated compounds, saturated hydrocarbon with the Chemical formula, formula and a Molecular geometry, 3D structure resembling a wedge, hence the name. Cuneane may be produced from cubane by metal-ion-catalyzed ...
, 1,2-dicyclopropylethane) and cannot be considered a correct method for representing a graph canonically. There is currently no systematic comparison across commercial software to test if such flaws exist in those packages. SMILES notation allows the specification of configuration at tetrahedral centers, and double bond geometry. These are structural features that cannot be specified by connectivity alone, and therefore SMILES which encode this information are termed isomeric SMILES. A notable feature of these rules is that they allow rigorous partial specification of chirality. The term isomeric SMILES is also applied to SMILES in which
isomer In chemistry, isomers are molecules or polyatomic ions with identical molecular formula – that is, the same number of atoms of each element (chemistry), element – but distinct arrangements of atoms in space. ''Isomerism'' refers to the exi ...
s are specified.


Graph-based definition

In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a
depth-first Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible al ...
tree traversal In computer science, tree traversal (also known as tree search and walking the tree) is a form of graph traversal and refers to the process of visiting (e.g. retrieving, updating, or deleting) each node in a Tree (data structure), tree data stru ...
of a chemical graph. The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a
spanning tree In the mathematical field of graph theory, a spanning tree ''T'' of an undirected graph ''G'' is a subgraph that is a tree which includes all of the vertices of ''G''. In general, a graph may have several spanning trees, but a graph that is no ...
. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes. Parentheses are used to indicate points of branching on the tree. The resultant SMILES form depends on the choices: * of the bonds chosen to break cycles, * of the starting atom used for the depth-first traversal, and * of the order in which branches are listed when encountered.


SMILES definition as strings of a context-free language

From the view point of a formal language theory, SMILES is a word. A SMILES is parsable with a context-free parser. The use of this representation has been in the prediction of biochemical properties (incl. toxicity and
biodegradability Biodegradation is the breakdown of organic matter by microorganisms, such as bacteria and fungi. It is generally assumed to be a natural process, which differentiates it from composting. Composting is a human-driven process in which biodegrada ...
) based on the main principle of chemoinformatics that similar molecules have similar properties. The predictive models implemented a syntactic pattern recognition approach (which involved defining a molecular distance) as well as a more robust scheme based on statistical pattern recognition.


Description


Atoms

Atom Atoms are the basic particles of the chemical elements. An atom consists of a atomic nucleus, nucleus of protons and generally neutrons, surrounded by an electromagnetically bound swarm of electrons. The chemical elements are distinguished fr ...
s are represented by the standard abbreviation of the
chemical element A chemical element is a chemical substance whose atoms all have the same number of protons. The number of protons is called the atomic number of that element. For example, oxygen has an atomic number of 8: each oxygen atom has 8 protons in its ...
s, in square brackets, such as u/code> for
gold Gold is a chemical element; it has chemical symbol Au (from Latin ) and atomic number 79. In its pure form, it is a brightness, bright, slightly orange-yellow, dense, soft, malleable, and ductile metal. Chemically, gold is a transition metal ...
. Brackets may be omitted in the common case of atoms which: # are in the " organic subset" of B, C, N, O, P, S, F, Cl, Br, or I, and # have no
formal charge In chemistry, a formal charge (F.C. or ), in the covalent view of chemical bonding, is the hypothetical charge assigned to an atom in a molecule, assuming that electrons in all chemical bonds are shared equally between atoms, regardless of rela ...
, and # have the number of hydrogens attached implied by the SMILES valence model (typically their normal valence, but for N and P it is 3 or 5, and for S it is 2, 4 or 6), and # are the normal
isotope Isotopes are distinct nuclear species (or ''nuclides'') of the same chemical element. They have the same atomic number (number of protons in their Atomic nucleus, nuclei) and position in the periodic table (and hence belong to the same chemica ...
s, and # are not
chiral centers In stereochemistry, a stereocenter of a molecule is an atom (center), axis or plane that is the focus of stereoisomerism; that is, when having at least three different groups bound to the stereocenter, interchanging any two different groups cr ...
. All other elements must be enclosed in brackets, and have charges and hydrogens shown explicitly. For instance, the SMILES for
water Water is an inorganic compound with the chemical formula . It is a transparent, tasteless, odorless, and Color of water, nearly colorless chemical substance. It is the main constituent of Earth's hydrosphere and the fluids of all known liv ...
may be written as either O or H2/code>. Hydrogen may also be written as a separate atom; water may also be written as /code>. When brackets are used, the symbol H is added if the atom in brackets is bonded to one or more hydrogen, followed by the number of hydrogen atoms if greater than 1, then by the sign + for a positive charge or by - for a negative charge. For example, H4+/code> for
ammonium Ammonium is a modified form of ammonia that has an extra hydrogen atom. It is a positively charged (cationic) polyatomic ion, molecular ion with the chemical formula or . It is formed by the protonation, addition of a proton (a hydrogen nucleu ...
(). If there is more than one charge, it is normally written as digit; however, it is also possible to repeat the sign as many times as the ion has charges: one may write either
i+4 I4, i4, I 4 or I-4 may refer to: Arts, entertainment, and media * '' I-4: Loafing and Camouflage'', a Greek film Military * 1st Life Grenadier Regiment (Sweden) (1816–1927), a Swedish infantry regiment * , a World War II Type J1 submarine o ...
/code> or i++++/code> for
titanium Titanium is a chemical element; it has symbol Ti and atomic number 22. Found in nature only as an oxide, it can be reduced to produce a lustrous transition metal with a silver color, low density, and high strength, resistant to corrosion in ...
(IV) Ti4+. Thus, the
hydroxide Hydroxide is a diatomic anion with chemical formula OH−. It consists of an oxygen and hydrogen atom held together by a single covalent bond, and carries a negative electric charge. It is an important but usually minor constituent of water. It ...
anion An ion () is an atom or molecule with a net electrical charge. The charge of an electron is considered to be negative by convention and this charge is equal and opposite to the charge of a proton, which is considered to be positive by conven ...
() is represented by H-/code>, the
hydronium In chemistry, hydronium (hydroxonium in traditional British English) is the cation , also written as , the type of oxonium ion produced by protonation of water. It is often viewed as the positive ion present when an Arrhenius acid is dissolved ...
cation () is H3+/code> and the
cobalt Cobalt is a chemical element; it has Symbol (chemistry), symbol Co and atomic number 27. As with nickel, cobalt is found in the Earth's crust only in a chemically combined form, save for small deposits found in alloys of natural meteoric iron. ...
(III)
cation An ion () is an atom or molecule with a net electrical charge. The charge of an electron is considered to be negative by convention and this charge is equal and opposite to the charge of a proton, which is considered to be positive by convent ...
(Co3+) is either
o+3 O, or o, is the fifteenth letter and the fourth vowel letter of the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''o'' (pronounced ), ...
/code> or o+++/code>.


Bonds

A bond is represented using one of the symbols . - = # $ : / \. Bonds between
aliphatic In organic chemistry, hydrocarbons ( compounds composed solely of carbon and hydrogen) are divided into two classes: aromatic compounds and aliphatic compounds (; G. ''aleiphar'', fat, oil). Aliphatic compounds can be saturated (in which all ...
atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. Although single bonds may be written as -, this is usually omitted. For example, the SMILES for
ethanol Ethanol (also called ethyl alcohol, grain alcohol, drinking alcohol, or simply alcohol) is an organic compound with the chemical formula . It is an Alcohol (chemistry), alcohol, with its formula also written as , or EtOH, where Et is the ps ...
may be written as C-C-O, CC-O or C-CO, but is usually written CCO. Double, triple, and quadruple bonds are represented by the symbols =, #, and $ respectively as illustrated by the SMILES O=C=O (
carbon dioxide Carbon dioxide is a chemical compound with the chemical formula . It is made up of molecules that each have one carbon atom covalent bond, covalently double bonded to two oxygen atoms. It is found in a gas state at room temperature and at norma ...
), C#N (
hydrogen cyanide Hydrogen cyanide (formerly known as prussic acid) is a chemical compound with the chemical formula, formula HCN and structural formula . It is a highly toxic and flammable liquid that boiling, boils slightly above room temperature, at . HCN is ...
HCN) and a+ s-/code> (
gallium arsenide Gallium arsenide (GaAs) is a III-V direct band gap semiconductor with a Zincblende (crystal structure), zinc blende crystal structure. Gallium arsenide is used in the manufacture of devices such as microwave frequency integrated circuits, monoli ...
). An additional type of bond is a "non-bond", indicated with ., to indicate that two parts are not bonded together. For example, aqueous
sodium chloride Sodium chloride , commonly known as Salt#Edible salt, edible salt, is an ionic compound with the chemical formula NaCl, representing a 1:1 ratio of sodium and chloride ions. It is transparent or translucent, brittle, hygroscopic, and occurs a ...
may be written as a+ l-/code> to show the dissociation. An aromatic "one and a half" bond may be indicated with :; see below. Single bonds adjacent to double bonds may be represented using / or \ to indicate stereochemical configuration; see below.


Rings

Ring structures are written by breaking each ring at an arbitrary point (although some choices will lead to a more legible SMILES than others) to make an acyclic structure and adding numerical ring closure labels to show connectivity between non-adjacent atoms. For example,
cyclohexane Cyclohexane is a cycloalkane with the molecular formula . Cyclohexane is non-polar. Cyclohexane is a colourless, flammable liquid with a distinctive detergent-like odor, reminiscent of cleaning products (in which it is sometimes used). Cyclohexan ...
and
dioxane Dioxane may refer to the following chemical compounds: * 1,2-dioxane * 1,3-dioxane * 1,4-dioxane {{Authority control ...
may be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2. For example,
decalin Decalin (decahydronaphthalene, also known as bicyclo .4.0ecane and sometimes decaline), a bicyclic organic compound, is an industrial solvent. A colorless liquid with an aromatic odor, it is used as a solvent for many resins or fuel additives. I ...
(decahydronaphthalene) may be written as C1CCCC2C1CCCC2. SMILES does not require that ring numbers be used in any particular order, and permits ring number zero, although this is rarely used. Also, it is permitted to reuse ring numbers after the first ring has closed, although this usually makes formulae harder to read. For example, bicyclohexyl is usually written as C1CCCCC1C2CCCCC2, but it may also be written as C0CCCCC0C0CCCCC0. Multiple digits after a single atom indicate multiple ring-closing bonds. For example, an alternative SMILES notation for decalin is C1CCCC2CCCCC12, where the final carbon participates in both ring-closing bonds 1 and 2. If two-digit ring numbers are required, the label is preceded by %, so C%12 is a single ring-closing bond of ring 12. Either or both of the digits may be preceded by a bond type to indicate the type of the ring-closing bond. For example,
cyclopropene Cyclopropene is an organic compound with the formula . It is the simplest cycloalkene. Because the ring is highly strained, cyclopropene is difficult to prepare and highly reactive. This colorless gas has been the subject for many fundamental s ...
is usually written C1=CC1, but if the double bond is chosen as the ring-closing bond, it may be written as C=1CC1, C1CC=1, or C=1CC=1. (The first form is preferred.) C=1CC-1 is illegal, as it explicitly specifies conflicting types for the ring-closing bond. Ring-closing bonds may not be used to denote multiple bonds. For example, C1C1 is not a valid alternative to C=C for
ethylene Ethylene (IUPAC name: ethene) is a hydrocarbon which has the formula or . It is a colourless, flammable gas with a faint "sweet and musky" odour when pure. It is the simplest alkene (a hydrocarbon with carbon–carbon bond, carbon–carbon doub ...
. However, they may be used with non-bonds; C1.C2.C12 is a peculiar but legal alternative way to write
propane Propane () is a three-carbon chain alkane with the molecular formula . It is a gas at standard temperature and pressure, but becomes liquid when compressed for transportation and storage. A by-product of natural gas processing and petroleum ref ...
, more commonly written CCC. Choosing a ring-break point adjacent to attached groups can lead to a simpler SMILES form by avoiding branches. For example, cyclohexane-1,2-diol is most simply written as OC1CCCCC1O; choosing a different ring-break location produces a branched structure that requires parentheses to write.


Aromaticity

Aromatic In organic chemistry, aromaticity is a chemical property describing the way in which a conjugated system, conjugated ring of unsaturated bonds, lone pairs, or empty orbitals exhibits a stabilization stronger than would be expected from conjugati ...
rings such as
benzene Benzene is an Organic compound, organic chemical compound with the Chemical formula#Molecular formula, molecular formula C6H6. The benzene molecule is composed of six carbon atoms joined in a planar hexagonal Ring (chemistry), ring with one hyd ...
may be written in one of three forms: # In Kekulé form with alternating single and double bonds, e.g. C1=CC=CC=C1, # Using the aromatic bond symbol :, e.g. C:1:C:C:C:C:C1, or # Most commonly, by writing the constituent B, C, N, O, P and S atoms in lower-case forms b, c, n, o, p and s, respectively. In the latter case, bonds between two aromatic atoms are assumed (if not explicitly shown) to be aromatic bonds. Thus,
benzene Benzene is an Organic compound, organic chemical compound with the Chemical formula#Molecular formula, molecular formula C6H6. The benzene molecule is composed of six carbon atoms joined in a planar hexagonal Ring (chemistry), ring with one hyd ...
,
pyridine Pyridine is a basic (chemistry), basic heterocyclic compound, heterocyclic organic compound with the chemical formula . It is structurally related to benzene, with one methine group replaced by a nitrogen atom . It is a highly flammable, weak ...
and
furan Furan is a Heterocyclic compound, heterocyclic organic compound, consisting of a five-membered aromatic Ring (chemistry), ring with four carbon Atom, atoms and one oxygen atom. Chemical compounds containing such rings are also referred to as f ...
can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Aromatic nitrogen bonded to hydrogen, as found in
pyrrole Pyrrole is a heterocyclic, aromatic, organic compound, a five-membered ring with the formula . It is a colorless volatile liquid that darkens readily upon exposure to air. Substituted derivatives are also called pyrroles, e.g., ''N''-methylpyrrol ...
must be represented as H/code>; thus
imidazole Imidazole (ImH) is an organic compound with the formula . It is a white or colourless solid that is soluble in water, producing a mildly alkaline solution. It can be classified as a heterocycle, specifically as a diazole. Many natural products, ...
is written in SMILES notation as n1c Hc1. When aromatic atoms are singly bonded to each other, such as in
biphenyl Biphenyl (also known as diphenyl, phenylbenzene, 1,1′-biphenyl, lemonene or BP) is an organic compound that forms colorless crystals. Particularly in older literature, compounds containing the functional group consisting of biphenyl less one ...
, a single bond must be shown explicitly: c1ccccc1-c2ccccc2. This is one of the few cases where the single bond symbol - is required. (In fact, most SMILES software can correctly infer that the bond between the two rings cannot be aromatic and so will accept the nonstandard form c1ccccc1c2ccccc2.) The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.


Branching

Branches are described with parentheses, as in CCC(=O)O for
propionic acid Propionic acid (, from the Greek language, Greek words πρῶτος : ''prōtos'', meaning "first", and πίων : ''píōn'', meaning "fat"; also known as propanoic acid) is a naturally occurring carboxylic acid with chemical formula . It is a ...
and FC(F)F for
fluoroform Fluoroform, or trifluoromethane, is the chemical compound with the formula . It is a hydrofluorocarbon as well as being a part of the haloforms, a class of compounds with the formula (X = halogen) with C3v symmetry. Fluoroform is used in divers ...
. The first atom within the parentheses, and the first atom after the parenthesized group, are both bonded to the same branch point atom. The bond symbol must appear inside the parentheses; outside (e.g. CCC=(O)O) is invalid. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N
see depiction
and COc(cc1)ccc1C#N
see depiction
which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable. Branches may be written in any order. For example,
bromochlorodifluoromethane Bromochlorodifluoromethane (BCF), also referred to by the code numbers Halon 1211 and Freon 12B1, is a haloalkane with the chemical formula C F2 Cl Br. It is used for fire suppression, especially for expensive equipment or items that could be da ...
may be written as FC(Br)(Cl)F, BrC(F)(F)Cl, C(F)(Cl)(F)Br, or the like. Generally, a SMILES form is easiest to read if the simpler branch comes first, with the final, unparenthesized portion being the most complex. The only caveats to such rearrangements are: * If ring numbers are reused, they are paired according to their order of appearance in the SMILES string. Some adjustments may be required to preserve the correct pairing. * If stereochemistry is specified, adjustments must be made; see below. The one form of branch which does ''not'' require parentheses are ring-closing bonds: the SMILES fragment C1N is equivalent to C(1)N, both denoting a bond between the C and the N. Choosing ring-closing bonds adjacent to branch points can reduce the number of parentheses required. For example,
toluene Toluene (), also known as toluol (), is a substituted aromatic hydrocarbon with the chemical formula , often abbreviated as , where Ph stands for the phenyl group. It is a colorless, water Water is an inorganic compound with the c ...
is normally written as Cc1ccccc1 or c1ccccc1C, avoiding the parentheses required if written as c1cc(C)ccc1 or c1cc(ccc1)C.


Stereochemistry

SMILES permits, but does not require, specification of
stereoisomer In stereochemistry, stereoisomerism, or spatial isomerism, is a form of isomerism in which molecules have the same molecular formula and sequence of bonded atoms (constitution), but differ in the three-dimensional orientations of their atoms in ...
s. Configuration around double bonds is specified using the characters / and \ to show directional single bonds adjacent to a double bond. For example, F/C=C/F
see depiction
is one representation of ''
trans Trans- is a Latin prefix meaning "across", "beyond", or "on the other side of". Used alone, trans may refer to: Sociology * Trans, a sociological term which may refer to: ** Transgender, people who identify themselves with a gender that di ...
''-
1,2-difluoroethylene 1,2-Difluoroethylene, also known as 1,2-difluoroethene, is an organofluoride with the molecular formula CHF. It can exist as either of two geometric isomers, ''cis''-1,2-difluoroethylene or ''trans''-1,2-difluoroethylene. It is regarded as a haz ...
, in which the fluorine atoms are on opposite sides of the double bond (as shown in the figure), whereas F/C=C\F
see depiction
is one possible representation of '' cis''-1,2-difluoroethylene, in which the fluorines are on the same side of the double bond. Bond direction symbols always come in groups of at least two, of which the first is arbitrary. That is, F\C=C\F is the same as F/C=C/F. When alternating single-double bonds are present, the groups are larger than two, with the middle directional symbols being adjacent to two double bonds. For example, the common form of (2,4)-hexadiene is written C/C=C/C=C/C. As a more complex example, beta-carotene has a very long backbone of alternating single and double bonds, which may be written CC1CCC/C(C)=C1/C=C/C(C)=C/C=C/C(C)=C/C=C/C=C(C)/C=C/C=C(C)/C=C/C2=C(C)/CCCC2(C)C. Configuration at tetrahedral carbon is specified by @ or @@. Consider the four bonds in the order in which they appear, left to right, in the SMILES form. Looking toward the central carbon from the perspective of the first bond, the other three are either clockwise or counter-clockwise. These cases are indicated with @@ and @, respectively (because the @ symbol itself is a counter-clockwise spiral). For example, consider the
amino acid Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although over 500 amino acids exist in nature, by far the most important are the 22 α-amino acids incorporated into proteins. Only these 22 a ...
alanine Alanine (symbol Ala or A), or α-alanine, is an α-amino acid that is used in the biosynthesis of proteins. It contains an amine group and a carboxylic acid group, both attached to the central carbon atom which also carries a methyl group sid ...
. One of its SMILES forms is NC(C)C(=O)O, more fully written as N HC)C(=O)O. L-Alanine, the more common
enantiomer In chemistry, an enantiomer (Help:IPA/English, /ɪˈnænti.əmər, ɛ-, -oʊ-/ Help:Pronunciation respelling key, ''ih-NAN-tee-ə-mər''), also known as an optical isomer, antipode, or optical antipode, is one of a pair of molecular entities whi ...
, is written as N @@HC)C(=O)O
see depiction
. Looking from the nitrogen–carbon bond, the hydrogen (H), methyl (C), and carboxylate (C(=O)O) groups appear clockwise. D-Alanine can be written as N @HC)C(=O)O
see depiction
. While the order in which branches are specified in SMILES is normally unimportant, in this case it matters; swapping any two groups requires reversing the chirality indicator. If the branches are reversed so alanine is written as NC(C(=O)O)C, then the configuration also reverses; L-alanine is written as N @HC(=O)O)C
see depiction
. Other ways of writing it include C @HN)C(=O)O, OC(=O) @@HN)C and OC(=O) @HC)N. Normally, the first of the four bonds appears to the left of the carbon atom, but if the SMILES is written beginning with the chiral carbon, such as C(C)(N)C(=O)O, then all four are to the right, but the first to appear (the H/code> bond in this case) is used as the reference to order the following three: L-alanine may also be written @@HC)(N)C(=O)O. The SMILES specification includes elaborations on the @ symbol to indicate stereochemistry around more complex chiral centers, such as
trigonal bipyramidal molecular geometry In chemistry, a trigonal bipyramid formation is a molecular geometry with one atom at the center and 5 more atoms at the corners of a triangular bipyramid. This is one geometry for which the bond angles surrounding the central atom are not ident ...
.


Isotopes

Isotopes Isotopes are distinct nuclear species (or ''nuclides'') of the same chemical element. They have the same atomic number (number of protons in their nuclei) and position in the periodic table (and hence belong to the same chemical element), but ...
are specified with a number equal to the integer isotopic mass preceding the atomic symbol.
Benzene Benzene is an Organic compound, organic chemical compound with the Chemical formula#Molecular formula, molecular formula C6H6. The benzene molecule is composed of six carbon atoms joined in a planar hexagonal Ring (chemistry), ring with one hyd ...
in which one atom is
carbon-14 Carbon-14, C-14, C or radiocarbon, is a radioactive isotope of carbon with an atomic nucleus containing 6 protons and 8 neutrons. Its presence in organic matter is the basis of the radiocarbon dating method pioneered by Willard Libby and coll ...
is written as 4cHccccc1 and deuterochloroform is H(Cl)(Cl)Cl.


Examples

To illustrate a molecule with more than 9 rings, consider cephalostatin-1, a steroidic 13-ringed
pyrazine Pyrazine is a heterocyclic aromatic organic compound with the chemical formula C4H4N2. It is a symmetrical molecule with point group D2h. Pyrazine is less basic than pyridine, pyridazine and pyrimidine. It is a ''"deliquescent crystal or wax-lik ...
with the
empirical formula In chemistry, the empirical formula of a chemical compound is the simplest whole number ratio of atoms present in a compound. A simple example of this concept is that the empirical formula of sulfur monoxide, or SO, is simply SO, as is the empir ...
C54H74N2O10 isolated from the
Indian Ocean The Indian Ocean is the third-largest of the world's five oceanic divisions, covering or approximately 20% of the water area of Earth#Surface, Earth's surface. It is bounded by Asia to the north, Africa to the west and Australia (continent), ...
hemichordate Hemichordata ( ) is a phylum which consists of triploblastic, eucoelomate, and bilaterally symmetrical marine deuterostome animals, generally considered the sister group of the echinoderms. They appear in the Lower or Middle Cambrian and incl ...
'' Cephalodiscus gilchristi'': : Starting with the left-most methyl group in the figure: :CC(C)(O1)C @@HO) @@(O2) @@HC) @@HCC=C4 @(C2)C(=O)C @H @HCC @@HC6) @(C)Cc(n7)c6nc(C @@9(C))c7C @@HCC @@H10 @@HC @@HO) @@11(C)C%10=C @HO%12) @11(O) @HC) @12(O%13) @HO)C @@13(C)CO % appears in front of the index of ring closure labels above 9; see above.


Other examples of SMILES

The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.


Extensions

SMARTS is a line notation for specification of substructural patterns in molecules. While it uses many of the same symbols as SMILES, it also allows specification of wildcard atoms and bonds, which can be used to define substructural queries for
chemical database A chemical database is a database specifically designed to store chemical information. This information is about chemical and crystal structures, spectra, reactions and syntheses, and thermophysical data. Types of chemical databases Bioactiv ...
searching. One common misconception is that SMARTS-based substructural searching involves matching of SMILES and SMARTS strings. In fact, both SMILES and SMARTS strings are first converted to internal graph representations which are searched for subgraph
isomorphism In mathematics, an isomorphism is a structure-preserving mapping or morphism between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between the ...
. SMIRKS, a superset of "reaction SMILES" and a subset of "reaction SMARTS", is a line notation for specifying reaction transforms. The general syntax for the reaction extensions is REACTANT>AGENT>PRODUCT (without spaces), where any of the fields can either be left blank or filled with multiple molecules delineated with a dot (.), and other descriptions dependent on the base language. Atoms can additionally be identified with a number (e.g. :1/code>) for mapping, for example in . SMILES corresponds to discrete molecular structures. However many materials are macromolecules, which are too large (and often stochastic) to conveniently generate SMILES for. BigSMILES is an extension of SMILES that aims to provide an efficient representation system for macromolecules.


Conversion

SMILES can be converted back to two-dimensional representations using structure diagram generation (SDG) algorithms. This conversion is sometimes ambiguous. Conversion to three-dimensional representation is achieved by energy-minimization approaches. There are many downloadable and web-based conversion utilities.


See also

* SMILES arbitrary target specification (SMARTS), an extension of SMILES for specification of substructural queries * SYBYL Line Notation, another line notation *
International Chemical Identifier The International Chemical Identifier (InChI, pronounced ) is a textual identifier for chemical substances, designed to provide a standard way to encode molecular information and to facilitate the search for such information in databases and on ...
(InChI), the
IUPAC The International Union of Pure and Applied Chemistry (IUPAC ) is an international federation of National Adhering Organizations working for the advancement of the chemical sciences, especially by developing nomenclature and terminology. It is ...
's alternative to SMILES * Molecular Query Language, a
query language A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. In database systems, query languages rely on strict theory to retrieve informa ...
allowing also numerical properties, e.g. physicochemical values or distances *
Chemistry Development Kit The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed un ...
, 2D layout and conversion software *
OpenBabel Open Babel is a free chemical informatics software designed to facilitate the conversion of Chemical file formats and manage molecular data. It serves as a chemical expert system, widely used in fields such as cheminformatics, molecular modelli ...
, JOELib, OELib (conversion)


References

{{DEFAULTSORT:Simplified Molecular Input Line Entry System Chemical nomenclature Encodings Chemical file formats