2.2.2. Anatomy of a MOL file

Most connection table formats contain one or more of the following:

  • A list of atoms, specifying the elemental identity of each atom
  • A list of bonds, specifying the atoms that it connects and the bond multiplicity (single, double, triple)
  • 2D or 3D spatial coordinates for each atom (sometimes measured, sometimes calculated; often, it’s not clear which)
  • Counts of the number of atoms and bonds in the molecule
  • Attributes associated with atoms or bonds (e.g. R/S configuration of a stereocenter; dashed/wedged bond
  • Attributes associated with an entire structure (e.g. net charge)

 

The MOL file, a widely-used chemical structure file format, contains all of these.

Here is a MOL file for benzoic acid, generated by ChemDraw, which provides options to save or to copy sketches in this file format.

MOL file for benzoic acid as generated from ChemDraw

The following figures illustrate the anatomy of a MOL file (MOL v2000, to be specific): the counts line, the atoms block, the bonds block, and the properties block.

Counts Block:

Counts Line of MOL file

Atoms Block:

MOL file Atoms Block for Benzoic Acid

Bonds Block

MOL file bonds block for benzoic acid

Properties Block

MOL file properties block for benzoic acid

Note that benzoic acid has nine atoms and nine bonds, not counting hydrogen. If all explicit hydrogen was included in this connection table, there would be six more entries in the atoms and bonds blocks, and the counts line would show fifteen atoms and fifteen bonds.

 

Multiple molecules

A connection table can represent multiple distinct compounds. Take a look at MOL II, phthalic acid.

MOL file for phthalic acid.  This will be compared to the phthalic anhydride and water in the next image

We can represent the stoichiometrically equivalent phthalic anhydride plus water by keeping the same atom block and changing a couple of entries in the bonds block. Now, we have one connection table (MOL III) representing two molecules. (Connection tables can also be used in representing reactions. For more on this, see online documentation on MOL and related file formats.)

MOL file of phthalic anhydrid and waterLet's compare the bond tables of the above two files:

Comparing bond table (block) of phthalic acid and phthalic anhydride/water

 

Tricky features

Working with connection tables can become tricky when it comes to features of chemical identity that are not directly represented as a static collection of atoms and covalent bonds, such as:

  • Aromaticity and delocalization
  • Tautomerism
  • Coordination

Sometimes these phenomena are not (or even cannot be) represented in the connection table at all. Other times, different file formats (or different users of the same file format) will adopt different conventions for indicating them. This can make things tricky for those who want to manipulate chemical structure data across the tens of millions of known chemical compounds (and the limitless space of possible compounds). However, it also means that there are ample opportunities for developing clever cheminformatic solutions to the limitations of connection tables.

Few of these issues are likely to be solved completely. Think of the following examples, and the exercises that follow, as training in the sort of questions that you would be prudent to ask when it comes to working with digital data about chemical structures.

 

Aromaticity

Structural formulas I, IV, and V all representing the same molecule: benzoic acid. However, remember: connection tables are typically correspond to structural formulas on an atom-by-atom, bond-by-bond level, not on a holistic level. Since these are three different patterns of atoms and bond, they correspond to three different MOL files. Each of the two Kekulé structures for the benzene ring shows up as a different set of single and double bonds (MOL I, MOL IV).

Atom table for three different ways of drawing benzoic acid

The Bond Tables are different:

Three different bond tables for benzoic acid

The MOL file format uses the number 4 to indicate bonds that are explicitly labeled as aromatic (MOL V). This has the advantage of differentiating aromatic bonds from single and double bonds without requiring the chemist to write a script to identify and label the alternating single and double bonds of a Kekulé structure. However, some software may not be built to handle this convention. (You might even run into cases in which it’s interpreted as a quadruple bond!)

 

Conjugate acids and bases

Two structural formulas may represent the same compound in different conditions. (E.g. conjugate acids/bases.) Again – keep in mind that, even though these structural formulas may refer to the same compound, they will be represented by different connection tables. You may need to choose one or the other of these connection tables / structural representations – or both – depending on your aims and the conventions of the database that you’re using. (V, VI)

MOL file for acid/base conjugate pair

 

Resonance

Run-of-the mill delocalization presents some of the same problems as aromaticity, but there is no conventional label for (non-aromatic) delocalized electrons, such as the delocalized negative charge and pi system in benzoate (VII and VIII). The connection tables will simply represent one resonance structure or another.

MOL file showing resonance structure

 

Tautomerism

Connection tables don’t link together tautomers in a straightforward way. You may need to work with multiple connection tables to account for different tautomers or to make sure that you have the most appropriate one for your purposes (IX, X).

Tautomers represent a challenge for MOL files

 

Chirality

MOL files do indicate chirality. However, they can do so in two ways. A “1” or “6” in the fourth field of the bonds table indicates wedged and dashed bonds, respectively. A “1” or “2” in the stereochemistry field of the atom table represents the chirality of a stereocenter. (To make things even more complicated, software may account for the chirality of a stereocenter atom when generating a MOL file but ignore it when rendering a MOL file!) (XI, XII)

MOL file showing chirality

 

Hack-a-Mol

Here’s a website st St. Olaf College where you can play with the relationship between 2D structures, 3D renderings, identifiers, and connection tables, courtesy of the cheminformatician Bob Hanson. There’s a link on the page to a document explaining “How it Works” (also linked here). As this course proceeds you will learn how we communicate with the NCI resolver and PubChem, and many of the fundamental features behind this application.

We have also embedded Hack-a-Mol below, and when doing your assignments you may want to open in a new window.


Hack-a-Mol

This page is under construction

End of test


Let’s take another look at benzoic acid. Clear the 2D sketch window using the white box button at the top, second from the left, and then draw benzoic acid. Click the right arrow button. That should render a 3D structure in the window to the right and generate a MOL file in the text window below. (For details on how where this data comes from, see “2D to 3D” and “3D to structure data” sections in “How it Works.”)

Now, take a look at the MOL file in the text window. You will note that, as a default, Hack-a-Mol includes explicit H in the MOL files it generates. (See discussion of explicit and implicit H earlier in this module for more information.)

Identify the atoms and bonds that make up the ring. (These will vary depending on the way that you drew the molecule – the 2D sketch application numbers atoms and bonds in the order that they are drawn.) Remember, the first two columns in each bond table entry refer to rows in the atom table, and the third column gives the bond type (1=single, 2=double, etc.) connecting these two atoms. (You can check yourself by hovering over atoms in the 3D window or clicking the “labels” link above this window.)

Once you have identified the six ring bonds in the MOL file, manually adjust them to generate the other Kekulé structure of the ring. (That is, switch the 1’s for 2’s and the 2’s for 1’s in the bond type fields (third column) of the bond table entries for the six ring bonds.) With the cursor still in the text window, press enter. This should generate the other Kekulé structure for benzoic acid in both the 3D and 2D windows.

Just for kicks, let’s generate a nonsense structure. Change all of the ring bonds to double bonds, and press enter. You should now have a chemically-offensive structure involving a cyclohexahexene ring with six positively charged carbon atoms violating valence rules. There’s a lesson here – software won’t tell you that your structure data is chemically nonsensical unless it is programmed to do so.

Revert to benzoic acid, either by changing the bonds back manually or just by clearing the 2D sketch window, re-drawing, and clicking the right arrow button again.

Now, let’s stick a chlorine atom onto the benzene ring. Using the atom and bond tables, locate the atom table entry for a ring hydrogen ortho, meta, or para to the carboxyl group (your pick!). Change the atom symbol in this atom table entry from H to Cl, and press enter. You should now have the chlorobenzoic acid isomer of your choice in both 3D and 2D windows.

One more exercise: let’s make our benzoic acid into pyridine-3-carboxylic acid – that is, benzoic acid with N in place of one of the ring carbons meta to the carboxylic group. This is the compound better known as niacin (vitamin B3).

(Tangential fun fact: niacin, discovered as an acidic reaction product of nicotine, was originally named nicotinic acid. In the 1930s, it was found to be the essential nutrient that prevented pellagra, a devastating disorder widely prevalent in the American South in the early twentieth century. Public health officials promoted enriching flour with nicotinic acid, and the epidemic of pellagra began to disappear. However, physicians and scientists worried that the name “nicotinic acid” gave the impression that they were curing mass disease by putting tobacco into bread. A National Research Council committee decided to change the name of the substance to niacin, short for nicotinic acid vitamin.)

Anyway: locate the entry for a ring carbon meta to the carboxyl group. (Hint: 1) use the atom and bond tables to identify the carbon atom bonded to the two oxygen atoms; 2) find the ring carbon bonded to that carboxyl carbon; 3) find a ring carbon two bonds away from that carboxyl-substituted ring carbon.) Change that carbon to N, and press enter.

Now we have the N atom in our ring, but you will notice that it’s positively charged. We didn’t change any of the explicit hydrogens, so the N atom remains protonated, like the C atom that it replaced. Let’s get rid of that hydrogen atom. Locate the entry for the N-H bond in the bond table and the entry for the corresponding H atom in the atom table, and delete both of them. Press enter.

Unless you were very lucky, you should now have a monstrous mess in the 3D window and nothing at all in the 2D window. Uh-oh. Go back to the MOL file window, press ctrl-Z twice to undo the deletion of those rows, and press enter. That will take you back to N-protonated niacin.

By deleting a row of the atom table, we renumbered all of the subsequent atom table entries. Since we didn’t change the atom references in the bond table, this broke all of the bonds to these renumbered atoms.

Once again, delete that N-H bond from the bond table and the entry for that H atom in the atom table. However, now fix the bond table references by **decreasing the atom number by 1** for all atoms below the row that you deleted. (That is, if the hydrogen that you deleted was the 13th atom table entry, change each 14 in the first two columns of the bond table to a 13, and change each 15 in the first two columns of the bond table to a 14.)

Hit enter. Ugh – your structure is probably screwed up **again**, even if you did all of this renumbering correctly. You may even have lost your ring, for some reason.

Take a look at the counts line of the MOL file – the row above the atom table, just below the file headers. The first two numbers in this line refer to the number of atoms and bonds in the molecule. Since we deleted an atom and a bond, we need to decrease each of these from 15 to 14. Do so, and then press enter again. You should now have niacin.

Whew. Thank goodness that connection table handling is so amenable to automation!

Play around some more with Hack-a-Mol. Take a look at the “How it Works” page – a lot of the notations, apps, and processes referred to on this page will be covered in subsequent weeks. You may find it useful to continue to come back to this page and play around with it as you move on in this course.

Exercises

1. Does Hack-A-Mol handle the number 4 for an aromatic bond? How can you tell? Can you create a chemically sound but non-aromatic structure using 4s in the bond field?

2. Perfluorinated octanoic acid (PFOA) is a surfactant that played a key role for a long time in the manufacture of fluorinated polymers including Teflon. Over the past decade, it has been the subject of significant public health concern and a whole bunch of litigation.

Pull PFOA into Hack-a-Mol by typing it into the text search box below the 3D window and clicking “search.”

2a. Edit the mole file to defluorinate PFOA, converting it into octanoic acid.

2b. Now make it into acetic acid. (It is possible to do this in a way that yields correct-looking 2D and 3D renderings without changing any XYZ coordinately, but you have to be ***very*** careful about how you delete and relabel atoms and bonds.) 

Further Reading

  1. https://en.wikipedia.org/wiki/Chemical_table_file
  2. CTFile Formats, June 2005, Elsevier/MDL, https://web.archive.org/web/20070630061308/http://www.mdl.com/downloads/public/ctfile/ctfile.pdf (Documentation for v2000 MOL file and related chemical table file formats.)
  3. Hack-a-Mol: https://chemapps.stolaf.edu/jmol/jsmol/hackamol.htm
    (Documentation: https://chemapps.stolaf.edu/jmol/docs/misc/hackamolworkings.pdf)

 

 

 

Rating: 
0
No votes yet
TLOs: 
Hack-a-Mol

Table of Contents

Hack-a-Mol

Comments: 10
Join the conversation.

Comments 10

Olcc S15 | Wed, 02/08/2017 - 14:53

When using tricky features such as aromaticity, Is it a given that MOL V will always represent a ring structure no matter the atom and bond level as opposed to the kekule structures represented as MOL I and MOL IV?

Evan Hepler-Smith's picture
Evan Hepler-Smith | Wed, 02/08/2017 - 15:18

The short answer: no.

The long answer:

For each of MOL I, IV, and V, the bond table contains a collection of entries that make a ring (atom 1 to atom 2, 2 to 3, 3 to 4, 4 to 5, 5 to 6, and 6 back to 1, closing the circuit). Without bond table entries that form a closed chain in this way, you won't end up with a ring.

So what happens if you enter "4" (indicating an aromatic bond) in the bond type field for a structure that can't possibly be aromatic, such as a straight chain? Exercise 1 at the end of this section asks you to investigate how Hack-a-mol deals with such situations.

It's important to remember that there is not an absolute "right" answer to this question. There are just the answers that happen to have been programmed into a particular parsing algorithm or software for rendering connection tables as (graphical) structural formulas. You can imagine a few possibilities: a program might return an error (the thinking: "That's not aromatic! Bad file!"). Alternatively, a program might display conjugated double bonds (the thinking: "Okay, that's not aromatic, but whatever program created this file must have just meant "conjugated" instead of aromatic. And aromaticity is not an especially precise concept, anyway.") You can probably imagine other choices that a cheminformatics programmer might make.

(These sorts of ambiguities are among the reasons why some databases default to Kekule structures. Even though they can be confusing, they avoid possible ambiguities in how to correlate connection table data with specific chemical structures.)

Anyway - try messing around with Hack-a-mol and see what you can discover about how it treats those aromatic "4s" when applied to bonds that can't possibly be aromatic.

Evan

Olcc S15 | Wed, 02/08/2017 - 15:35

Thank you Evan. I realized that was based on comparing with the Kekule structures not just all "conjugated". I will try with the benzene ring to see a comparison .

Daniel

olcc s16 | Tue, 02/14/2017 - 14:13

at the bottom this page, there are 3 mol file for Glycine. glycine.txt is the file Hack a Mol obtained from pubchem and glycineZi, we tried to get the zwitterion to work but the Nitrogen bonded 5 times, and in glycineZixyz, we changed the x coordinate for atom 10 but we cannot show the negative charge on the Oxygen. Can you upload glycineZixyz to Hack a Mol and tell us what we need to do to look right?

Bob Belford's picture
Bob Belford | Tue, 02/14/2017 - 14:39

Just to clarify the above question, I had asked the class to change the file for Glycine from its neutral form to the Zwitterion, which ends up being a bit more complicated than just changing the atom the carboxylic acid bonds to.

Evan Hepler-Smith's picture
Evan Hepler-Smith | Tue, 02/14/2017 - 16:43

Let's see...

It looks like the 3D structure for glycineZixyz is more or less correct, but the negative charge on the carboxylate oxygen is missing in the 2D structure, correct?

In order to explain what I **think** is going on here, we'll have to take a look at "How it Works" that Dr. Hanson put together for Hack-a-mol. Here's the link (also linked at the bottom of the Hack-a-mol frame): https://chemapps.stolaf.edu/jmol/docs/misc/hackamolworkings.pdf

The relevant passage:

"When the user modifies the structure (or pastes into the textarea some sort of structure file data) and presses ENTER, [..details re. script...] the loadMol() method is executed. This method
checks to see if the data is 2D mol data (the characters “2D” starting at column 21 on the second line of the file) or 3D data (anything else), and then passes the data either to the appropriate module."

Okay, no "2D" in the specified place, so that's where the 3D structure comes from. How do we from 3D to 2D? We go to that section of How it Works:

"We need to maintain both a 3D and a 2D representation when a 3D model is loaded. In order to do
that, we again tap into the CIR at NCI. This is done in the to2D() function [...details re. script...] which then sends the following command to the CIR, again using a SMILES string to communicate"

Hack-a-mol also displays this SMILES string: "SMILES: [O]C(=O)C[NH3] at ChEMBL"

Note that we've got the extra proton on the N, but there's no charge on the O. My guess is that this is part of the reason why you're getting the glycinium cation rather than the zwitterion.

We could try directly inputting a SMILES string for the zwitterion [O-]C(=O)C[NH3+] into the text box. Same problem: we get the right 3D, but we're still missing the negative charge in the 2D. In fact, the same thing happens if we plug in, say, the SMILES for acetate CC([O-])=O . Same with azide [N-]=[N+]=[N-] . The negative ions just don't want to show up in the 2D.

Since the path, as specified by How It Works, is mol file --> 3D --> 2D AND identifier --> 3D --> 2D, and since the 3D --> 2D conversion goes through a SMILES string, it seems reasonable to guess that something funky is going on with how the 3D renderer is handling the SMILES for anions.

We'll have to get Dr. Hanson to weigh in on this puzzle!

Thanks,
Evan

Bob Hanson's picture
Bob Hanson | Wed, 02/15/2017 - 18:58

Ah, that's interesting! The problem with the JSME app (2D, on the left) not recognizing the negative charge appears to be due to the fact that when Jmol request the JME format from CIR, which it needs to send to JSME, the call is this:

https://cactus.nci.nih.gov/chemical/structure/%5BNH3+1%5DCC(=O)%5BO-1%5D/file?format=jme

and the return is this:

8 7 N 0.537 1 H 0 0.69 H 0.227 1.54 H 0.847 0.463 C 1.4 1.5 C 2.27 1 O 2.27 0 O 3.14 1.5 1 2 1 1 3 1 1 4 1 1 5 1 5 6 1 6 7 2 6 8 1

Eh? you say?

8 atoms
7 bonds
[2D coordinates follow]
N 0.537 1
H 0 0.69
H 0.227 1.54
H 0.847 0.463
C 1.4 1.5
C 2.27 1
O 2.27 0
O 3.14 1.5
[Bonds follow, referencing atoms above]
1 2 1
1 3 1
1 4 1
1 5 1
5 6 1
6 7 2
6 8 1

But, wait -- no mention of charge!? The N has three Hs attached, so that can be figured out. But I'm afraid we are out of luck for the oxygen. JSME apparently doesn't know that we have specified ALL the H atoms, and it just supplies the missing on.

It's a limitation.

I will check with the author of JSME (Bruno Bienfeit) and see what he has to say; maybe I am just missing some sort of flag that will set this right....

....OK, what I have learned is that NCI has a bug in that it is returning an incorrect JME string when formal charges are present. The proper return would be something like this:

7 6 N+ 11.31 -5.97 C 9.86 -5.13 C 8.40 -5.97 O- 6.96 -5.13 O 8.40 -7.64 H 12.20 -5.45 H 11.31 -7.00 1 2 1 2 3 1 3 4 1 3 5 2 1 6 1 1 7 1

(Note the "+" and "-" signs there.)

Alas, that is the only server I know of that can do this, so I think we are out of luck on fixing that. It's a limitation.

Bob Hanson

Otis Rothenberger's picture
Otis Rothenberger | Wed, 02/15/2017 - 15:21

Bob,

I'm almost certain it's a Resolver issue. If I extract the JME file for the glycine zwitterion from JME, I get the following with charge designation on oxygen:

5 4 O- 13.15 -5.14 C 11.70 -5.97 O 11.70 -7.64 C 10.24 -5.14 N+ 8.80 -5.97 1 2 1 2 3 2 2 4 1 4 5 1

JME file on Resolver goes way back to Markus. I think this is a case of Resolver not keeping up with changes in the JME file structure. By the way, the above loads correctly in Jmol.

Otis

Annotations