3.2. International Chemical Identifier (InChI) and InChIKey

  • InChI

        The IUPAC International Chemical Identifier (InChI)22-25 was originally developed by the IUPAC and continuing development efforts have been made by the InChI Trust25.  InChI is non-proprietary, open-source, and freely available to the scientific community.  Especially, because the software for generating InChI strings is also freely available, it avoids the interoperability issue that different implementations of SMILES language have.

        InChI encodes a chemical structure into “layers”.  Each layer holds a distinct and separable class of structural information, with the layers ordered to provide successive structural refinement.  There are currently six InChI layer types, each different class of structural information: the main layer, a charge layer, a stereochemical layer, an isotopic layer, a fixed-H layer and a reconnected layer.  The main layer, which specifies chemical formula, atoms, and bonds between them, is required for all InChIs.  However, the other layers appear only when corresponding input information is provided.  Layers and sublayers start with “/” (forward slash) followed by a letter denoting the identity of the layer (except for the chemical formula layer).  Below are some examples of InChI.

InChI=1S/CH4/h1H4 (methane)

InChI=1S/C2H6/c1-2/h1-2H3 (ethane)

InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 (ethanol)

InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 (L-alanine)

These InChI strings are not easy for a human to understand (especially compared to SMILES strings).  It is because InChI was developed as a “machine-readable” chemical identifier, with an aim to enable a computer to regenerate the corresponding chemical structure from the InChI string generated by another computer.  For this reason, InChI is often called as the bar code for chemical structures.

        Because the layered structure of InChI allows one to represent a chemical structure with a desired level of details, InChI software may generate different InChI strings for the same molecule.  This flexibility may be regarded as an obstacle to standardization and interoperability.  In response to this concern, the standard InChI was introduced which contains the same level of structural details and the same conventions for drawing perception, by using standard option settings in InChI software.  The standard InChI representations begin with “InChI=1S/”, while the non-standard InChI begins with “InChI=1/”.  The digit “1” following “InChI=” is the current InChI version number.

  • InChIKey

        The length of an InChI string increases with the size of the corresponding chemical structure, and it is very common that molecules with more than 100 atoms result in very long InChI strings, which are not appropriate to use in internet search engines (such as Google, Yahoo, Bing, and so on).  In addition, these search engines do not care about case sensitivity nor special characters used in InChI.  To address this issue, the InChIKey was introduced for Internet and database searching/indexing.  It is a 27-character string derived from InChI, using a hashing algorithm.  Hashing is a one-way mathematical transformation typically used to calculate a compact fixed length digital representation of a much longer string of arbitrary length.

        The InChIKey consists of three blocks, separated by hyphens, for example:

BSYNRYMUTXBXSQ-UHFFFAOYSA-N (aspirin)

HEFNNWSXXWATRW-UHFFFAOYSA-N (ibuprofen)

RZVAJINKPMORJF-UHFFFAOYSA-N (acetaminophen)

The first block of 14 characters (out of 27 characters in total) encodes core molecular constitution, described by the InChI main layer.  The other structural features (such as stereochemistry, isotopic substitution, exact position of hydrogens, and metal ligation data) are encoded into the second block.  The protonation or deprotonation state is encoded in the last InChIKey character.

        Many databases such as PubChem29ChemSpider30ChEBI31, and NIST Chemistry Webbook 32 accept InChI and InChIKey strings as queries to search for chemical structures.  InChIs and InChIKeys can also be used as queries in UniChem33 to produce cross-references between chemical structure identifiers from different databases.

CLICK SECTION TITLE FOR MORE INFORMATION

The following material was provided by Stephen Heller, Project Director of the InChI Trust.




The most detailed technical article on InChI is:
InChI, the IUPAC International Chemical Identifier
Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein,
Dmitrii Tchekhovskoi Journal of Cheminformatics 2015, 7:23 (30 May 2015)
 
 

 

Rating: 
0
No votes yet
Join the conversation.

Comments 2

Dr. Heller | Fri, 10/09/2015 - 09:57

I would like to add that there are more large databases which contain up to about 100 million InChIs/InChIKeys, some of which are not free (except NCI) to access:

NIH/NCI – 110 million
NIH/PubChem - 91 million (68 million online)
EBI UniChem – 91 million
RSC/ChemSpider – 34 million
Elsevier/Reaxys – 30 million

I would also like to mention that there are InChI for chemical reactions:

International chemical identifier for reactions (RInChI)

Guenter Grethe, Jonathan M Goodman and Chad HG Allen
J. Cheminf. 2013, 5:45, published online on 24 October 2013. Read here.

A number of freely available articles are in the Open Access J. of Cheminformatics about InChI:
http://www.jcheminf.com/search/results?terms=InChI

The most detailed technical article on InChI is:
InChI, the IUPAC International Chemical Identifier
Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein,
Dmitrii Tchekhovskoi Journal of Cheminformatics 2015, 7:23 (30 May 2015)

Please feel free to contact me for any questions or issues about InChI

Steve Heller

Annotations