The Simplified Molecular-Input Line-Entry System (SMILES)18-21 is a line notation for describing chemical structures using short ASCII strings. SMILES was developed in the late 1980s and implemented by Daylight Chemical Information Systems (Santa Fe, NM), but it is still widely used today. A detailed information on SMILES can be found in Chapter 326 of the Daylight Theory Manual as well as the SMILES tutorial27.
- SMILES Specification Rules
In SMILES, atoms are represented by their atomic symbols. The second letter of two-character atomic symbols must be entered in lower case (e.g., Cl not CL; Br not BR). Each non-hydrogen atom is specified independently by its atomic symbol enclosed in square brackets, [ ] (for example, [Au] or [Fe]). Square brackets may be omitted for elements in the “organic subset” (B, C, N, O, P, S, F, Cl, Br, and I) if the proper number of “implicit” hydrogen atoms is assumed. “Explicitly” attached hydrogens and formal charges are always specified inside brackets. A formal charge is represented by one of the symbols + or -. Single, double, triple, and aromatic bonds are represented by the symbols, -, =, #, and :, respectively. Single and aromatic bonds may be, and usually are, omitted. Here are some examples of SMILES strings.
C Methane (CH4)
CC Ethane (CH3CH3)
C=C Ethene (CH2CH2)
C#C Ethyne (CHCH)
COC Dimethyl ether (CH3OCH3)
CCO Ethanol (CH3CH2OH)
CC=O Acetaldehyde (CH3-CH=O)
C#N Hydrogen Cyanide (HCN)
[C-]#N Cyanide anion
Branches are specified by enclosures in parentheses and can be nested or stacked, as shown in these examples.
CC(C)CO Isobutyl alcohol (CH3-CH(CH3)-CH2-OH)
CC(CCC(=O)N)CN 5-amino-4-methylpentanamide
Rings are represented by breaking one single or aromatic bond in each ring, and designating this ring-closure point with a digit immediately following the atoms connected through the broken bond. Atoms in aromatic rings are specified by lower cases letters. Therefore, cyclohexane and benzene can be represented by the following SMILES.
C1CCCCC1 Cyclohexane (C6H12)
c1ccccc1 Benzene (C6H6)
Although the carbon-carbon bonds in these two SMILES are omitted, it is possible to deduce that the omitted bonds are single bonds (for cyclohexane) and aromatic bonds (for benzene). One can also represent an aromatic compound as a non-aromatic, KeKulé structure. For example, the following is a valid SMILES string for benzene.
C1=CC=CC=C1 Benzene (C6H6)
Note that aromaticity is not a measurable physical quantity, but a concept without a unanimous mathematical definition. As a result, different aromaticity detection algorithms often disagree with each other on whether a given molecule is aromatic or not, making it difficult to interchange information between databases that use different aromaticity detection algorithms for SMILES generation.
Also note that a ring structure can have multiple potential ring-closure points. For example, a six-membered ring has six bonds, each of which can be a ring-closure point. As a result, a ring compound may be represented by many different but equally valid SMILES strings. Actually, it is very common that there are a lot of SMILES strings that represent the same structure, whether it has a ring or not, because one can start with any atom in a molecule to derive a SMILES string. Therefore, it is necessary to select a “unique SMILES” for a molecule among many possibilities. Because this is done through a process called “canonicalization”, this unique SMILES string is also called the “canonical SMILES”.
Isomeric SMILES allows for specifying isotopism and stereochemistry of a molecule. Information on isotopism is indicated by the integral atomic mass preceding the atomic symbol. The atomic mass must be specified inside square brackets. For example, C-13 methane can be represented by “[13CH4]”. Configuration around double bonds is specified by “directional bonds” (characters / and \). For example, E- and Z-1,2-difluoroethene can be represented by the following isomeric SMILES:
F/C=C/F or F\C=C\F (E)-1,2-difluoroethene (trans isomer)
F/C=C\F or F\C=C/F (Z)-1,2-difluoroethene (cis isomer)
Configuration around tetrahedral centers are indicated by the symbols “@” or “@@”
C[C@@H](C(=O)O)N L-Alanine
C[C@H](C(=O)O)N D-Alanine
More detailed information on chirality specification can be found in Chapter 326 of the Daylight Theory Manual.
SMILES is proprietary and it is not an open project. This has led different chemical software developers to use different SMILES generation algorithms, resulting in different SMILES versions for the same compound. Therefore, SMILES strings obtained from different databases or research groups are not interchangeable unless they used the same software to generate the SMILES strings. With an aim to address this interchangeability issue of SMILES, an open-source project has launched to develop an open, standard version of the SMILES language called OpenSMILES.28 However, the most noticeable community effort in this area is development of InChI, which is described in next section.