2.3 Chemical Representations on Computer: Part III

Sunghwan Kim, National Center for Biotechnology Information

Learning Objectives

  • Explain what SMILES, SMARTS and SMIRKS are.
  • Explain what InChI and InChIKey are.
  • Review SMILES specification rules.
  • Compare and contrast SMILES and InChI.
  • Demonstrate how to interpret SMILES, SMARTS, InChI strings into their corresponding chemical structures.

Table of Contents

 

 

Line notations represent structures as a linear string of characters.  They are widely used in Cheminformatics because computers can more easily process linear strings of data. Examples of line notations include the Wiswesser Line-Formula Notation (WLN)1, Sybyl Line Notation (SLN)2,3 and Representation of structure diagram arranged linearly (ROSDAL)4,5.  Currently, the most widely used linear notations are the Simplified Molecular-Input Line-Entry System (SMILES)6-9 and the IUPAC Chemical Identifier (InChI)10-13, which are described below.

SMILES and related notation

SMILES

        The Simplified Molecular-Input Line-Entry System (SMILES)6-9 is a line notation for describing chemical structures using short ASCII strings.  SMILES was developed in the late 1980s and implemented by Daylight Chemical Information Systems (Santa Fe, NM), but it is still widely used today.  A detailed information on SMILES can be found in Chapter 314 of the Daylight Theory Manual as well as the SMILES tutorial15.

SMILES Specification Rules

In SMILES, atoms are represented by their atomic symbols.  The second letter of two-character atomic symbols must be entered in lower case.  Each non-hydrogen atom is specified independently by its atomic symbol enclosed in square brackets, [ ] (for example, [Au] or [Fe]).  Square brackets may be omitted for elements in the “organic subset” (B, C, N, O, P, S, F, Cl, Br, and I) if the proper number of “implicit” hydrogen atoms is assumed.  “Explicitly” attached hydrogens and formal charges are always specified inside brackets. A formal charge is represented by one of the symbols + or -.  Single, double, triple, and aromatic bonds are represented by the symbols, -, =, #, and :, respectively.  Single and aromatic bonds may be, and usually are, omitted.  Here are some examples of SMILES strings.

C               Methane (CH4)

CC             Ethane (CH3CH3)

C=C           Ethene (CH2CH2)

C#C           Ethyne (CHCH)

COC           Dimethyl ether (CH3OCH3)

CCO           Ethanol (CH3CH2OH)

CC=O         Acetaldehyde (CH3-CH=O)

C#N           Hydrogen Cyanide (HCN)

[C-]#N       Cyanide anion

Branches are specified by enclosures in parentheses and can be nested or stacked, as shown in these examples.

CC(C)CO                         Isobutyl alcohol (CH3-CH(CH3)-CH2-OH)

CC(CCC(=O)N)CN           5-amino-4-methylpentanamide

Rings are represented by breaking one single or aromatic bond in each ring, and designating this ring-closure point with a digit immediately following the atoms connected through the broken bond.  Atoms in aromatic rings are specified by lower cases letters.  Therefore, cyclohexane and benzene can be represented by the following SMILES.

C1CCCCC1        Cyclohexane (C6H12)

c1ccccc1           Benzene (C6H6)

Although the carbon-carbon bonds in these two SMILES are omitted, it is possible to deduce that the omitted bonds are single bonds (for cyclohexane) and aromatic bonds (for benzene).  One can also represent an aromatic compound as a non-aromatic, KeKulé structure. For example, the following is a valid SMILES string for benzene.

C1=CC=CC=C1         Benzene (C6H6)

Note that aromaticity is not a measurable physical quantity, but a concept without a unanimous mathematical definition.  As a result, different aromaticity detection algorithms often disagree with each other on whether a given molecule is aromatic or not, making it difficult to interchange information between databases that use different aromaticity detection algorithms for SMILES generation.

Also note that a ring structure can have multiple potential ring-closure points. For example, a six-membered ring has six bonds, each of which can be a ring-closure point.  As a result, a ring compound may be represented by many different but equally valid SMILES strings.  Actually, it is very common that there are a lot of SMILES strings that represent the same structure, whether it has a ring or not, because one can start with any atom in a molecule to derive a SMILES string.  Therefore, it is necessary to select a “unique SMILES” for a molecule among many possibilities.  Because this is done through a process called “canonicalization”, this unique SMILES string is also called the “canonical SMILES”.

 

Isomeric SMILES

        Isomeric SMILES allows for specifying isotopism and stereochemistry of a molecule.  Information on isotopism is indicated by the integral atomic mass preceding the atomic symbol.  The atomic mass must be specified inside square brackets.  For example, C-13 methane can be represented by “[13CH4]”.  Configuration around double bonds is specified by “directional bonds” (characters / and \).  For example, E- and Z-1,2-difluoroethene can be represented by the following isomeric SMILES:

F/C=C/F or F\C=C\F         (E)-1,2-difluoroethene (trans isomer)

F/C=C\F or F\C=C/F         (Z)-1,2-difluoroethene (cis isomer)

Configuration around tetrahedral centers are indicated by the symbols “@” or “@@”

C[C@@H](C(=O)O)N        L-Alanine

C[C@H](C(=O)O)N           D-Alanine

More detailed information on chirality specification can be found in Chapter 314 of the Daylight Theory Manual.

Limitations of SMILES

        SMILES is proprietary and it is not an open project.  This has led different chemical software developers to use different SMILES generation algorithms, resulting in different SMILES versions for the same compound.  Therefore, SMILES strings obtained from different databases or research groups are not interchangeable unless they used the same software to generate the SMILES strings.  With an aim to address this interchangeability issue of SMILES, an open-source project has launched to develop an open, standard version of the SMILES language called OpenSMILES.16  However, the most noticeable community effort in this area is development of InChI, which is described in next section.

International Chemical Identifier (InChI) and InChIKey

InChI

        The IUPAC International Chemical Identifier (InChI)10-13 was originally developed by the IUPAC and continuing development efforts have been made by the InChI Trust13.  InChI is non-proprietary, open-source, and freely available to the scientific community.  Especially, because the software for generating InChI strings is also freely available, it avoids the interoperability issue that different implementations of SMILES language have.

        InChI encodes a chemical structure into “layers”.  Each layer holds a distinct and separable class of structural information, with the layers ordered to provide successive structural refinement.  There are currently six InChI layer types, each different class of structural information: the main layer, a charge layer, a stereochemical layer, an isotopic layer, a fixed-H layer and a reconnected layer.  The main layer, which specifies chemical formula, atoms, and bonds between them, is required for all InChIs.  However, the other layers appear only when corresponding input information is provided.  Layers and sublayers start with “/” (forward slash) followed by a letter denoting the identity of the layer (except for the chemical formula layer).  Below are some examples of InChI.

InChI=1S/CH4/h1H4 (methane)

InChI=1S/C2H6/c1-2/h1-2H3 (ethane)

InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 (ethanol)

InChI=1S/C3H7NO2/c1-2(4)3(5)6/h2H,4H2,1H3,(H,5,6)/t2-/m0/s1 (L-alanine)

These InChI strings are not easy for a human to understand (especially compared to SMILES strings).  It is because InChI was developed as a “machine-readable” chemical identifier, with an aim to enable a computer to regenerate the corresponding chemical structure from the InChI string generated by another computer.  For this reason, InChI is often called as the bar code for chemical structures.

        Because the layered structure of InChI allows one to represent a chemical structure with a desired level of details, InChI software may generate different InChI strings for the same molecule.  This flexibility may be regarded as an obstacle to standardization and interoperability.  In response to this concern, the standard InChI was introduced which contains the same level of structural details and the same conventions for drawing perception, by using standard option settings in InChI software.  The standard InChI representations begin with “InChI=1S/”, while the non-standard InChI begins with “InChI=1/”.  The digit “1” following “InChI=” is the current InChI version number.

InChIKey

        The length of an InChI string increases with the size of the corresponding chemical structure, and it is very common that molecules with more than 100 atoms result in very long InChI strings, which are not appropriate to use in internet search engines (such as Google, Yahoo, Bing, and so on).  In addition, these search engines do not care about case sensitivity nor special characters used in InChI.  To address this issue, the InChIKey was introduced for Internet and database searching/indexing.  It is a 27-character string derived from InChI, using a hashing algorithm.  Hashing is a one-way mathematical transformation typically used to calculate a compact fixed length digital representation of a much longer string of arbitrary length.

        The InChIKey consists of three blocks, separated by hyphens, for example:

BSYNRYMUTXBXSQ-UHFFFAOYSA-N (aspirin)

HEFNNWSXXWATRW-UHFFFAOYSA-N (ibuprofen)

RZVAJINKPMORJF-UHFFFAOYSA-N (acetaminophen)

 

The first block of 14 characters (out of 27 characters in total) encodes core molecular constitution, described by the InChI main layer.  The other structural features (such as stereochemistry, isotopic substitution, exact position of hydrogens, and metal ligation data) are encoded into the second block.  The protonation or deprotonation state is encoded in the last InChIKey character.

        Many databases such as PubChem17ChemSpider18ChEBI19, and NIST Chemistry Webbook20 accept InChI and InChIKey strings as queries to search for chemical structures.  InChIs and InChIKeys can also be used as queries in UniChem21 to produce cross-references between chemical structure identifiers from different databases.

Generic Structures

        A generic structure indicates a group of structurally similar compounds, using a symbol such as “R” (as in R-CH2-OH, where R = H, CH3, CH2CH3, CH(CH3)2, C(CH3)3, and so on).  Generic structures are commonly used in chemistry texts as well as in chemical patents in which the inventor claims a whole class of related compounds.  Generic structures are more often called “Markush” structures after Dr. Eugene A. Markush, who involved in a legal case which set a precedent in the USA for generic chemical structure patent filing.

        An early example of research projects on Markush structure storage and retrieval is the Sheffield Generic Structures Project, which led to a text-based language for generic structure description called GENSAL (GENeric Structure LAnguage)22 as well as an extended connection table representation for generic structures23.  The Sheffield generic structures system was never implemented commercially, but influenced two commercial systems: MARPAT24 (developed by CAS) and Markush DARC (currently Thomson Reuters’ Merged Markush Service25).

        Some public databases, such as PubChem, allow one to search for generic structures, using SMARTS (SMiles ARbitrary Target Specification).  It is a language used for describing molecular patterns.  SMARTS is useful for substructure searching, which finds a particular pattern (subgraph) in a molecule.  SMARTS are straightforward extensions of SMILES.  All SMILES symbols and properties are legal in SMARTS.  SMARTS includes logical operators and additional molecular descriptors.  Detailed information on SMARTS is given in the SMARTS specification document26 in the Daylight theory manual and SMARTS tutorial.27

        Another extension of SMILES is SMIRKS28,29, which is a line notation for generic reactions.  A generic reaction represents a group of reactions that undergo the same set of atom and bond changes.  Note that SMILES and SMARTS can be used to represent reactions, using the “>” symbol between the reactants, products, and agents, as described in the SMILES and SMARTS specification documents.  (Therefore, these SMILES and SMARTS that describe reactions are often called reaction SMILES and reaction SMARTS, respectively.)  On the other hand, SMIRKS is used to represent types of reactions (e.g., SN2 reaction).  More detailed information on SMIRKS is given in the SMIRKS specification document28 and SMIRKS tutorial29.

The following material was provided by Stephen Heller, Project Director of the InChI Trust.




References

(1)   Wiswesser, W. J. J. Chem. Inf. Comput. Sci. 198222, 88.

(2)   Ash, S.; Cline, M. A.; Homer, R. W.; Hurst, T.; Smith, G. B. J. Chem. Inf. Comput. Sci.199737, 71.

(3)   Homer, R. W.; Swanson, J.; Jilek, R. J.; Hurst, T.; Clark, R. D. J. Chem Inf. Model.200848, 2294.

(4)   Barnard, J. M.; Jochum, C. J.; Welford, S. M. Acs Symposium Series 1989400, 76.

(5)   Rohbeck, H. G. In Software Development in Chemistry 5; Gmehling, J., Ed.; Springer Berlin Heidelberg: 1991, p 49.

(6)   Weininger, D. J. Chem. Inf. Comput. Sci. 198828, 31.

(7)   Weininger, D.; Weininger, A.; Weininger, J. L. J. Chem. Inf. Comput. Sci. 198929, 97.

(8)   Weininger, D. J. Chem. Inf. Comput. Sci. 199030, 237.

(9)   SMILES: Simplified Molecular Input Line Entry System (http://www.daylight.com/smiles/) (Accessed on 6/30/2015).

(10)   Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. J. Cheminform. 20135, 7.

(11)   Heller, S.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. J. Cheminform. 20157, 23.

(12)   The IUPAC International Chemical Identifier (InChI) (http://www.iupac.org/home/publications/e-resources/inchi.html) (Accessed on 6/29/2015).

(13)   InChI Trust (http://www.inchi-trust.org/) (Accessed on 6/29/2015).

(14)   Daylight Theory Manual, Chapter 3: SMILES - A Simplified Chemical Language (http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html) (Accessed on 6/23/2015).

(15)   Daylight SMILES Tutorial (http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html) (Accessed on 6/23/2015).

(16)   OpenSMILES Home Page (http://www.opensmiles.org/) (Accessed on 6/23/2015).

(17)   PubChem (https://pubchem.ncbi.nlm.nih.gov) (Accessed on 6/29/2015).

(18)   ChemSpider (http://www.chemspider.com) (Accessed on 6/29/2015).

(19)   ChEBI (https://www.ebi.ac.uk/chebi/) (Accessed on 6/29/2015).

(20)   NIST Chemistry Webbook (http://webbook.nist.gov/chemistry/) (Accessed on 6/29/2015).

(21)   UniChem (https://www.ebi.ac.uk/unichem/) (Accessed on 6/29/2015).

(22)   Barnard, J. M.; Lynch, M. F.; Welford, S. M. J. Chem. Inf. Comput. Sci. 198121, 151.

(23)   Barnard, J. M.; Lynch, M. F.; Welford, S. M. J. Chem. Inf. Comput. Sci. 198222, 160.

(24)   MARPAT (https://www.cas.org/content/markush) (Accessed on 6/30/2015).

(25)   Merged Markush Service (http://ip-science.thomsonreuters.com/support/patents/dwpiref/reftools/classification/markush/) (Accessed on 6/30/2015).

(26)   Daylight Theory Manual, Chapter 4: SMARTS - A Language for Describing Molecular Patterns (http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html) (Accessed on 6/23/2015).

(27)   Daylight SMARTS Tutorial (http://www.daylight.com/dayhtml_tutorials/languages/smarts/index.html) (Accessed on 6/23/2015).

(28)   Daylight Theory Manual, Chapter 5: SMIRKS - A Reaction Transform Language (http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html) (Accessed on 10/8/2015).

(29)   Daylight SMIRKS Tutorial (http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html) (Accessed on 10/8/2015).

 

Module 5: Identifying Chemical Entities

By Sunghwan Kim

Questions

1. Go to the PubChem database (http://pubchem.ncbi.nlm.nih.gov) and search for omeprazole and esomeprazole.  Fill in the table below with appropriate chemical representations for the two molecules and answer the following questions.

(1) What is the structural difference between omeprazole and esomeprazole?
(2) Do omeprazole and esomeprazole have the same InChI and InChIKeys as each other?
(3) Do omeprazole and esomeprazole have the same canonical SMILES?  Explain why.

Omeprazol (from PubChem)

CID

 

IUPAC name

 

 

Canonical SMILES

 

Isomeric
SMILES

 

InChI

 

 

InChIKey

 

Esomeprazole (from PubChem)

CID

 

IUPAC name

 

 

Canonical SMILES

 

 

Isomeric
SMILES

 

 

InChI

 

 

InChIKey

 

 

2. Go to ChemSpider (http://www.chemspider.com) and search for omeprazole and esomeprazole.  Fill in the table below with appropriate chemical representations for the two molecules and answer the following questions.

(1) Are the systematic names from ChemSpider the same as those from PubChem?

(2) Are the canonical SMILES from ChemSpider the same as those from PubChem?

(3) Are the InChI and InChIKeys from ChemSpider the same as those from PubChem?

Omeprazol (from ChemSpider)

ChemSpider ID

 

 

IUPAC name

 

 

Canonical SMILES

 

 

Isomeric
SMILES

 

 

InChI

 

 

InChIKey

 

Esomeprazole (from ChemSpider)

ChemSpider ID

 

 

IUPAC name

 

 

Canonical SMILES

 

 

Isomeric
SMILES

 

 

InChI

 

 

InChIKey

 

 

 

3. Compare the SMILES strings from PubChem with those from ChemSpider for the following compounds, in terms of how the two databases deal with perceived aromaticity of the molecules.  Explain an advantage and a disadvantage of the SMILES strings used in each database.

 

 

SMILES from PubChem

SMILES from ChemSpider

Benzene

 

 

pyridine

 

 

Pyrrole

 

 

Furan

 

 

Thiophene

 

 

Selenophene

 

 

Tellurophene

 

 

 

 

 

 

4. Suppose that you are a project manager at Google, who are in charge of implementing a chemical search algorithm to the Google search.  This algorithm accepts a chemical structure as an input through the search box on the Google homepage (http://www.google.com), but the input needs to be a text string that represents a chemical structure.  Therefore, you need to choose a line notation that is most appropriate for this search system, among the canonical SMILES, InChI, and InChIKey.  Choose only one and justify your choice over the others, based on what you have learned from this module and from Questions 1, 2 and 3).

Join the conversation.

Comments 23

OLCC S199 | Fri, 02/17/2017 - 13:43

I wanted to see if the neutral and zwitter ion mol files for Glycine gave different InchIs. They are not different, but the InchI keys are.

At the bottom of this page I uploaded the file Sooyah.txt, which has the InchI, InchI key and mol files for both forms of Glycine.

Do you know why the keys are different if the InchIs are the same?

Sunghwan Kim | Fri, 02/17/2017 - 14:30

I don't know how you generated the InChI and InChIKeys for glycine (in both neutral and zwitter ionic forms). Would you please explain it?

Robert Belford's picture
Robert Belford | Fri, 02/17/2017 - 14:49

She pasted the mol files from the bottom of the page (Sooyah.txt). One was the neutral, one was the zwitterion. Our logic was that if two files were of the same molecule, but one was neutral and the other was a zwitterion, the InChI should be the same. Which it was, but the keys are different.

Sunghwan Kim | Fri, 02/17/2017 - 15:37

Let me explain why you got the unexpected results. Please follow the directions below to get the InChI and InChI strings directly from the NCI Resolver.

1. First use the NCI resolver to get the standard and non-standard InChI's for both neutral and zwitter ionic glycines through these links:

(1a) Standard InChI

Neutral: https://cactus.nci.nih.gov/chemical/structure/NCC(O)=O/stdinchi
Zwitter: https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/stdinchi

(1b) Non-standard InChI

Neutral: https://cactus.nci.nih.gov/chemical/structure/NCC(O)=O/inchi
Zwitter: https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/inchi

*** These are what you will get from the links above

(2a) Standard InChi

Neutral: InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)
Zwitter: InChI=1S/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)

(2b) Non-standard InChI

Neutral: InChI=1/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)/f/h4H
Zwitter: InChI=1/C2H5NO2/c3-1-2(4)5/h1,3H2,(H,4,5)/f/h3H

*** So the Standard InChIs for the neutral and zwitter ionic glycines are identicial to each other, while the non-standard InChI's are different. (Please pay attention to the FixedH layer beginning with "/f").

2. Now use the NCI resolver to get the standard and non-standard InChIKeys through these links.

(1a) Standard InChIKey

Neutral: https://cactus.nci.nih.gov/chemical/structure/NCC(O)=O/stdinchikey
Zwitter: https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/stdinchikey

(1b) Non-standard InChIKey

Neutral: https://cactus.nci.nih.gov/chemical/structure/NCC(O)=O/inchikey
Zwitter: https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/inchikey

*** These are what you will get from the links above:

(2a) Standard InChiKey

Neutral: DHMQDGOQFOQNFH-UHFFFAOYSA-N
Zwitter: DHMQDGOQFOQNFH-UHFFFAOYSA-N

(2b) Non-standard InChIKey

Neutral: DHMQDGOQFOQNFH-JLSKMEETNA-N
Zwitter: DHMQDGOQFOQNFH-TULZNQERNA-N

*** The standard InChIKey is the same for both forms of glycine, while the two forms have different InChIKeys. it is what should be expected from the definition of the standard InChI and InChIKeys.

3. I think you got the InChIs and InChIKeys by pasting the mol files into Hack-a-Mol. Try again with one of the mol file you got. To see what happened when you hit the "Enter" key, please click the "info" link above the 3-D view window. This will show a log of what just happened. Just scroll down to the buttom of this info box and you will see the followng two lines at the end:

(If you used the mol file for the neutral form)

FileManager opening url https://cactus.nci.nih.gov/chemical/structure/NCC(O)=O/stdinchi
FileManager opening url https://cactus.nci.nih.gov/chemical/structure/NCC(O)=O/inchikey

(If you used the mol file for the zwitter form)

FileManager opening url https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/stdinchi
FileManager opening url https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/inchikey

Pay attention to the last words on these URL. For some reason, Hack-a-Mol requested "stdinchi" and "inchikey", meaning that it got the Standard InChIs and the non-standard InChIs. So, that is the reason why you got the same InChI strings for both glycines, but different InChIKey strings for them.

I think Professor Hanson can tell us more details about this.

Bob Hanson's picture
Bob Hanson | Sat, 02/18/2017 - 11:40

see above for my reply -- sorry, pressed the wrong reply button.

Bob Hanson's picture
Bob Hanson | Sat, 02/18/2017 - 11:39

Hello, S199 (Sorry, that's all I have for your name!)

Just wanted to say thanks for such great observation! Hack-A-Mol and Jmol are both better as a result. This is exactly what it's about in open-source software. Wonderful!

Bob Hanson - St. Olaf College

Robert Belford's picture
Robert Belford | Fri, 02/17/2017 - 15:10

Yes, she pasted each mole file back into hack-a-mol, to see if the two mol files gave the same InChI, which they do. But they do not give the same InChI keys. I also checked with Bob's site, and it is the same.

Bob Hanson's picture
Bob Hanson | Sat, 02/18/2017 - 11:40

The reason this is happening is that hackamol.htm uses

var inchikey = Jmol.evaluateVar(jmol, "show('chemical inchikey')").trim();

instead of

var inchikey = Jmol.evaluateVar(jmol, "show('chemical stdinchikey')").trim();

I've corrected that at https://chemapps.stolaf.edu/jmol/jsmol/hackamol.htm and sent a note to admin to get it changed on this site as well.

Olcc S15 | Mon, 02/20/2017 - 13:30

Hello,
I was trying to find out if Omeprazole and Esomeprazole have the same Isomeric SMILES? Thanks

Sunghwan Kim | Mon, 02/20/2017 - 15:07

Omeprazole is a racemic mixture of (R)- and (S)-form, and its chemical structure has an unspecified stereo center. Esomeprazole is the (S)-enantiomer of omeprazole. Below is the link to the isomeric SMILES of esomeprazole.

https://pubchem.ncbi.nlm.nih.gov/compound/9568614#section=Isomeric-SMILES

Remember that isomeric SMILES contains isotope/stereochemistry specifications of a molecule. The SMILES string of esomeprazole contains the "@" symbol to explicitly show the (S)-configuration. Then, how about omeprazole, whose stereocenter has no explicitly specified configuration (e.g., either R or S)? Because of the absence of explicit chiral configurations, omeprazole does not have additional information to encode in its "isomeric" SMILES. (That is, its canonical SMILES and isomeric SMILES are identical. Therefore, the compound summary page of omeprazole shows canonical smiles only). On the contrary, you have both SMILES for esomeprazole.

olcc s16 | Mon, 02/20/2017 - 14:16

In question 2 that we have to look up information in ChemSpider, I was doing search and ChemSpider only provides 1 SMILES string for each compound. In the assignment, there are canonical and isomeric. So, do we have to identify if that string is isomeric or canonical? I'm a bit confused.

Thanks
Phuc

Sunghwan Kim | Mon, 02/20/2017 - 14:50

Sorry for the confusion. If there is no isomeric smiles or canonical smiles, then indicate it with "N/A".

To distinguish whether a SMILES string is isomeric or not, check whether that SMILES string has any information about the isotopes or stereochemistry of the molecule. (e.g., "@", "@@", "/", "\", ......). If it does have any isotopic/stereochemical information, it is an isomeric SMILES.

Evan Hepler-Smith's picture
Evan Hepler-Smith | Mon, 02/20/2017 - 20:22

Hi Phuc,

For a general discussion of "canonical" representations pertinent to this point, take a look at
"FORMULATING CHEMICAL STRUCTURE DATA" in part 1 of this module: http://olcc.ccce.divched.org/2017OLCCModule2P1

Thanks,
Evan

OLCC S17 | Mon, 02/20/2017 - 17:27

In module 2.3-chemical representations, why is it that Omeprazole and Esomeprazole have the same canonical SMILES? any help please??

Sunghwan Kim | Mon, 02/20/2017 - 20:05

Isotope and stereochemistry information of a molecule is not encoded into its canonical SMILE (but into its isomeric SMILES). Therefore, the canonical SMILES cannot tell you whether the chiral center of a molecule has the (R)- or (S)-configuration.

OLCC S197 | Mon, 02/20/2017 - 22:36

why is that Chemspider did not record the canonical and isomeric smiles of both omeprazole and esomeprazole.if it did please tell me how to get it.

Sunghwan Kim | Mon, 02/20/2017 - 23:11

All databases we mentioned in this course are autonomous and independent of each other. So, I can't say why one database choose to do what. For the purpose of homework, if you don't find canonical SMILES or isomeric SMILES from ChemSpider, then indicate it with "N/A" or "None".

Sunghwan Kim | Mon, 02/20/2017 - 23:23

Some people asked about getting canonical and isomeric SMILES from ChemSpider. Currently, ChemSpider shows only one SMILES for a given molecule, but the table in the home work question has two blanks. So, if you don't find one of the two SMILES, put "N/A" or "None" in the corresponding blank.

olcc s16 | Wed, 03/01/2017 - 18:35

HiI am studying for the exam tomorrow. Dr. Belford would expected us to know how to generate a SMILES and I doing some practice. I am trying to generate SMILES for Benzoic acid and come up with c1ccccc1(C(=O)O) but i look up at SMILES at ChemSpider giving me c1ccc(cc1)C(=O)O and Pubchem giving me C1=CC=C(C=C1)C(=O)O. My questions are does my generated SMILES equivalent to ChemSpider and Pubchem? and also since naming benzene with substituent, I can mark any carbon of benzene as carbon 1 and start connecting onward, I can also come up with this SMILES for benzoic acid, c1cccc(C(=O)O)c1. So is what i thought is correct?Thanks,Phuc

Ehren Bucholtz | Wed, 03/01/2017 - 19:11

It appears that you are getting correct possible SMILES for benzoic acid. The problem with SMILES is that there are many flavors of SMILES. If I remember correctly Daylight Chemical Information Systems defined SMILES first, but since it is proprietary, other versions were made that were different. By using different algorithms you can have different SMILES, and applying a Morgan algorithm can result in a more canonical form. It looks like the PubChem form is the canonical form, which is also the Daylight SMILES string. If you put your generated SMILES into the Online SMILES tranlator and Structure File Generator at https://cactus.nci.nih.gov/translate/  you can see that your SMILES and the ChemSpider SMILES resolve to the PubChem SMILES. The website appears to use the Daylight 1989 definition. As for having multiple SMILES, it all depends on what you determine is the first atom in the molecule, and number from there. This is all why InChI was developed to have a non-propietary format and algorithm as well as open source software. I like that SMILES are much more readable for simple molecules, but when you get to a complex molecule like morphine, neither the SMILES or InChI is particularly human readable:InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1Canonical SMILES: CN1CCC23C4C1CC5=C2C(=C(C=C5)O)OC3C(C=C4)OMy take is that if neither is partiularly human readable, better to have an non-propietary standard like InChi.Good luck on the exam!

Robert Belford's picture
Robert Belford | Wed, 03/01/2017 - 19:17

Phuc, and all,

This is a very good question, which hits a problem with our education system and how we grade students, in that students become conditioned to questions for which there is one correct answer, and we grade them based on that. Although being able to answer those types of questions is important, there is a limit to the types of knowledge they can assess, and if you ask me, doing science requires more than answering questions for which the answer is known.

I will not answer your question, but suggest you take these different smiles strings to a resource like the NCI chemical identifier resolver, and answer your own question.  :-)

You have actually hit one of the big challenges with teaching a course like this, as in many ways we are trying to teach students how to do science.

Cheers,
Bob

 

Sunghwan Kim | Wed, 03/01/2017 - 19:51

Hi, Phuc, All SMILES strings mentioned in your comment are equivalent.  As covered in Section 2.3, many equivalent SMILES strings exist for the same molecule, so that is the reason why we have canonical SMILES.  However, SMILES itself is proprietary and not an open project, even canonical SMILES from one program is not the same as those generated for other programs.  (That is the reason why we have InChI). **** By the way, there is a quick way to check whether a SMILES string you write is correct or not.  Input that SMILES string into any molecule editor and see how the program interpret it.  For example, you can copy and paste your SMILES to PubChem Sketcher and hit the enter key (https://pubchem.ncbi.nlm.nih.gov/edit2/index.html?cnt=0).  Then, you will see your SMILES for benzoic acid are interpreted correctly.  Have a good night. Sunghwan,   

Annotations