Chemical file format
   HOME

TheInfoList



OR:

A chemical file format is a type of data file which is used specifically to depicting molecular data. One of the most widely used is the
chemical table file Chemical table file (CT File) is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms. ...
format, which is similar to ''Structure Data Format'' (SDF) files. They are text files that represent multiple chemical structure records and associated data fields. The
XYZ file format The XYZ file format is a chemical file format. There is no formal standard and several variations exist, but a typical XYZ format specifies the molecule geometry by giving the number of atoms with Cartesian coordinates that will be read on the fi ...
is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates. The Protein Data Bank Format is commonly used for proteins but is also used for other types of molecules. There are many other types which are detailed below. Various software systems are available to convert from one format to another.


Distinguishing formats

Chemical information is usually provided as files or
streams A stream is a continuous body of surface water flowing within the bed and banks of a channel. Depending on its location or certain characteristics, a stream may be referred to by a variety of local or regional names. Long large streams a ...
and many formats have been created, with varying degrees of documentation. The format is indicated in three ways:
(see ) * ''file extension'' (usually 3 letters). This is widely used, but fragile as common suffixes such as ".mol" and ".dat" are used by many systems, including non-chemical ones. * ''self-describing files'' where the format information is included in the file. Examples are CIF and CML. * ''chemical/MIME type'' added by a chemically-aware server.


Chemical Markup Language

Chemical Markup Language (CML) is an open standard for representing molecular and other chemical data. The open source project includes XML Schema, source code for parsing and working with CML data, and an active community. The articles Tools for Working with Chemical Markup Language and XML for Chemistry and Biosciences discusses CML in more detail. CML data files are accepted by many tools, including JChemPaint,
Jmol Jmol is computer software for molecular modelling chemical structures in 3-dimensions. Jmol returns a 3D representation of a molecule that may be used as a teaching tool, or for research e.g., in chemistry and biochemistry. It is written in ...
,
XDrawChem XDrawChem is a free software program for drawing chemical structural formulas, available for Unix and macOS. It is distributed under the GNU GPL. In Microsoft Windows this program is called WinDrawChem. Major features * Fixed length and fixed ...
and MarvinView.


Protein Data Bank Format

The Protein Data Bank Format is commonly used for proteins but it can be used for other types of molecules as well. It was originally designed as, and continues to be, a fixed-column-width format and thus officially has a built-in maximum number of atoms, of residues, and of chains; this resulted in splitting very large structures such as ribosomes into multiple files. However, many tools can read files that exceed those limits. For example, the E. coli 70S
ribosome Ribosomes ( ) are macromolecular machines, found within all cells, that perform biological protein synthesis (mRNA translation). Ribosomes link amino acids together in the order specified by the codons of messenger RNA (mRNA) molecules to fo ...
was represented as 4 PDB files in 2009
3I1M3I1N
3I1O and 3I1P. In 2014 they were consolidated into a single file
4V6C
Some PDB files contain an optional section describing atom connectivity as well as position. Because these files are sometimes used to describe macromolecular assemblies or molecules represented in explicit solvent, they can grow very large and are often compressed. Some tools, such as Jmol and KiNG, can read PDB files in gzipped format. The wwPDB maintains the specifications of the PDB file format and its XML alternative, PDBML. There was a fairly major change in PDB format specification (to version 3.0) in August 2007, and a remediation of many file problems in the existing database. The typical file extension for a PDB file is ''.pdb'', although some older files use ''.ent'' or ''.brk''. Some molecular modeling tools write nonstandard PDB-style files that adapt the basic format to their own needs.


GROMACS format

The GROMACS file format family was created for use with the molecular simulation software package GROMACS. It closely resembles the PDB format but was designed for storing output from
molecular dynamics Molecular dynamics (MD) is a computer simulation method for analyzing the physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a fixed period of time, giving a view of the dynamic "evolution" of th ...
simulations, so it allows for additional numerical precision and optionally retains information about particle
velocity Velocity is the directional speed of an object in motion as an indication of its rate of change in position as observed from a particular frame of reference and as measured by a particular standard of time (e.g. northbound). Velocity i ...
as well as position at a given point in the simulation trajectory. It does not allow for the storage of connectivity information, which in GROMACS is obtained from separate molecule and system topology files. The typical file extension for a GROMACS file is ''.gro''.


CHARMM format

The CHARMM molecular dynamics package can read and write a number of standard chemical and biochemical file formats; however, the CARD (coordinate) and PSF (
protein structure Protein structure is the three-dimensional arrangement of atoms in an amino acid-chain molecule. Proteins are polymers specifically polypeptides formed from sequences of amino acids, the monomers of the polymer. A single amino acid monom ...
file) are largely unique to CHARMM. The CARD format is fixed-column-width, resembles the PDB format, and is used exclusively for storing atomic coordinates. The PSF file contains atomic connectivity information (which describes atomic bonds) and is required before beginning a simulation. The typical file extensions used are ''.crd'' and ''.psf'' respectively.


GSD format

The General Simulation Data (GSD) file format created for efficient reading / writing of generic particle simulations, primarily - but not restricted to - those from HOOMD-blue. The package also contains a python module that reads and writes hoomd schema gsd files with an easy to use syntax


Ghemical file format

The Ghemical software can use OpenBabel to import and export a number of file formats. However, by default, it uses the GPR format. This file is composed of several parts, separated by a tag (!Header, !Info, !Atoms, !Bonds, !Coord, !PartialCharges and !End). The proposed MIME type for this format is ''application/x-ghemical''.


SYBYL Line Notation

SYBYL Line Notation (SLN) is a chemical
line notation Line notation is a typographical notation system using ASCII characters, most often used for chemical nomenclature. Chemistry * Cell notation for representation of an electrochemical cell * Dyson / IUPAC (1944) * Hayward (1961) * International Ch ...
. Based on SMILES, it incorporates a complete syntax for specifying relative stereochemistry. SLN has a rich query syntax that allows for the specification of Markush structure queries. The syntax also supports the specification of combinatorial libraries of ChemDraw. Example SLNs


SMILES

The Simplified Molecular Input Line Entry System (SMILES) is a
line notation Line notation is a typographical notation system using ASCII characters, most often used for chemical nomenclature. Chemistry * Cell notation for representation of an electrochemical cell * Dyson / IUPAC (1944) * Hayward (1961) * International Ch ...
for molecules. SMILES strings include connectivity but do not include 2D or 3D coordinates. Hydrogen atoms are not represented. Other atoms are represented by their element symbols B, C, N, O, F, P, S, Cl, Br, and I. The symbol "=" represents double bonds and "#" represents triple bonds. Branching is indicated by (). Rings are indicated by pairs of digits. Some examples are


XYZ

The
XYZ file format The XYZ file format is a chemical file format. There is no formal standard and several variations exist, but a typical XYZ format specifies the molecule geometry by giving the number of atoms with Cartesian coordinates that will be read on the fi ...
is a simple format that usually gives the number of atoms in the first line, a comment on the second, followed by a number of lines with atomic symbols (or atomic numbers) and cartesian coordinates.


MDL number

The MDL number contains a unique identification number for each reaction and variation. The format is RXXXnnnnnnnn. R indicates a reaction, XXX indicates which database contains the reaction record. The numeric portion, nnnnnnnn, is an 8-digit number.


Other common formats

One of the most widely used industry standards are
chemical table file Chemical table file (CT File) is a family of text-based chemical file formats that describe molecules and chemical reactions. One format, for example, lists each atom in a molecule, the x-y-z coordinates of that atom, and the bonds among the atoms. ...
formats, like the ''Structure Data Format'' (SDF) files. They are text files that adhere to a strict format for representing multiple chemical structure records and associated data fields. The format was originally developed and published by Molecular Design Limited (MDL). MOL is another file format from MDL. It is documented in Chapter 4 of ''CTfile Formats''. PubChem also has XML and ASN1 file formats, which are export options from the PubChem online database. They are both text based (ASN1 is most often a binary format). There are a large number of other formats listed in the table below


Converting between formats

OpenBabel Open Babel is computer software, a chemical expert system mainly used to interconvert chemical file formats. About Due to the strong relationship to informatics this program belongs more to the category cheminformatics than to molecular model ...
and
JOELib JOELib is computer software, a chemical expert system used mainly to interconvert chemical file formats. Because of its strong relationship to informatics, this program belongs more to the category cheminformatics than to molecular modelling. ...
are freely available open source tools specifically designed for converting between file formats. Their chemical expert systems support a large atom type conversion tables. :obabel -i ''input_format'' ''input_file'' -o ''output_format'' ''output_file'' For example, to convert the file epinephrine.sdf in SDF to CML use the command :obabel -i sdf epinephrine.sdf -o cml epinephrine.cml The resulting file is epinephrine.cml.
IOData
is a free and open-source Python library for parsing, storing, and converting various file formats commonly used by quantum chemistry, molecular dynamics, and plane-wave density-functional-theory software programs. It also supports a flexible framework for generating input files for various software packages. For a complete list of supported formats, please go to https://iodata.readthedocs.io/en/latest/formats.html. A number of tools intended for viewing and editing molecular structures are able to read in files in a number of formats and write them out in other formats. The tools JChemPaint (based on the
Chemistry Development Kit The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed un ...
),
XDrawChem XDrawChem is a free software program for drawing chemical structural formulas, available for Unix and macOS. It is distributed under the GNU GPL. In Microsoft Windows this program is called WinDrawChem. Major features * Fixed length and fixed ...
(based on
OpenBabel Open Babel is computer software, a chemical expert system mainly used to interconvert chemical file formats. About Due to the strong relationship to informatics this program belongs more to the category cheminformatics than to molecular model ...
), Chime,
Jmol Jmol is computer software for molecular modelling chemical structures in 3-dimensions. Jmol returns a 3D representation of a molecule that may be used as a teaching tool, or for research e.g., in chemistry and biochemistry. It is written in ...
, Mol2mol and Discovery Studio fit into this category.


The Chemical MIME Project

"Chemical MIME" is a de facto approach for adding
MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
types to chemical streams.
This project started in January 1994, and was first announced during the Chemistry workshop at the First WWW International Conference, held at CERN in May 1994. ... The first version of an Internet draft was published during May–October 1994, and the second revised version during April–September 1995. A paper presented to the CPEP (Committee on Printed and Electronic Publications) at the IUPAC meeting in August 1996 is available for discussion.
In 1998 the work was formally published in the JCIM.


Support

For Linux/Unix, configuration files are available as a "''chemical-mime-data''" package in
.deb deb is the format, as well as extension of the software package format for the Debian Linux distribution and its derivatives. Design Debian packages are standard Unix ar archives that include two tar archives. One archive holds the cont ...
, RPM and tar.gz formats to register chemical MIME types on a web server. Programs can then register as viewer, editor or processor for these formats so that full support for chemical MIME types is available.


Sources of chemical data

Here is a short list of sources of freely available molecular data. There are many more resources than listed here out there on the Internet. Links to these sources are given in the references below. # The US
National Institute of Health The National Institutes of Health, commonly referred to as NIH (with each letter pronounced individually), is the primary agency of the United States government responsible for biomedical and public health research. It was founded in the late ...
PubChem database is a huge source of chemical data. All of the data is in two-dimensions. Data includes SDF, SMILES, PubChem XML, and PubChem ASN1 formats. #The worldwide Protein Data Bank
wwPDB
is an excellent source of protein and nucleic acid molecular coordinate data. The data is three-dimensional and provided in Protein Data Bank (PDB) format. #eMolecules is a commercial database for molecular data. The data includes a two-dimensional structure diagram and a smiles string for each compound. eMolecules supports fast substructure searching based on parts of the molecular structure. # ChemExper is a commercial data base for molecular data. The search results include a two-dimensional structure diagram and a mole file for many compounds. #
New York University New York University (NYU) is a private research university in New York City. Chartered in 1831 by the New York State Legislature, NYU was founded by a group of New Yorkers led by then- Secretary of the Treasury Albert Gallatin. In 1832, th ...
br>Library of 3-D Molecular Structures
#The US Environmental Protection Agency's The Distributed Structure-Searchable Toxicity (DSSTox) Database Network is a project of EPA's Computational Toxicology Program. The database provides SDF molecular files with a focus on carcinogenic and otherwise toxic substances.


See also

*
File format A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file format ...
*
OpenBabel Open Babel is computer software, a chemical expert system mainly used to interconvert chemical file formats. About Due to the strong relationship to informatics this program belongs more to the category cheminformatics than to molecular model ...
,
JOELib JOELib is computer software, a chemical expert system used mainly to interconvert chemical file formats. Because of its strong relationship to informatics, this program belongs more to the category cheminformatics than to molecular modelling. ...
, OELib *
Chemistry Development Kit The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed un ...
* Chemical Markup Language * Software for molecular modeling
NCI/CADD Chemical Identifier Resolver


References


External links

* * {{DEFAULTSORT:Chemical File Format