Module 4b: Assignment

2.4. Exercises (Module 4b)

2.4.1. Exercise 1  NOTE* You may want to download the MS Word Document at the top of the module homepage

Fill in the following table using both ChemSpider and PubChem and indicate where answers may differ. Use the new compound identity search in PubChem (click on the hexagon for structural formula, https://pubchem.ncbi.nlm.nih.gov/search/#collection=compounds&query_type=structure&query_subtype=identity)

molecular formula

Structural formula

Systematic name

SMILES

InChI

CAS RN

 

 

 

 

 

CC(=C)C=C

 

 

 

 

 

 

 

 

 

 

 

Ammonium acetate

 

 

 

 

 

 

C6H6O

 

 

 

 

 

 

 

 

 

 

 

 

105-53-3

2.4.2. Exercise 2

Which of above form(s) of notation is/are preferable for:

  1. Ordering a specific compound from a supplier

 

 

  1. Identifying an important compound in a journal article

 

 

 

  1. Sorting a list of a bunch of compounds into chemically-meaningful groups

 

 

 

2.4.3. Exercise 3

Search PubChem for lactic acid (racemic) and its two enantiomers, shown in Module 4a, Figure III.a-e and in section 2.2.2 above.

  1. Paste the urls, IUPAC names, and CAS numbers that you find below.

 

  1. You should see something strange in your answer to Part b).  What is it?

 

 

  1. Luckily, the makers of PubChem have followed some of the best practices that we’ve outlined above. What have they done that can help you get to the bottom of the inconsistency that you’ve discovered in parts (b-c)?

     

  2. Follow the clue that PubChem left you as to the source of the discrepancy and describe what might have happened with the data coming into Pubchem. (If you aren’t sure, what to do, click http://www.drugbank.ca/drugs/DB03066 and http://www.drugbank.ca/drugs/DB04398).
Rating: 
0
No votes yet
Join the conversation.

Comments 20

Alex Williams (not verified) | Tue, 09/29/2015 - 00:30
Are there any alternative word processors that have good chemistry fluency? It seems like it would be a popular product considering how unintuitive the interfaces are in word to represent molecular formulas.

Jordi Cuadros's picture
Jordi Cuadros | Wed, 09/30/2015 - 16:11
You may find interesting to check the Chemistry Add-in for Word: <a href="https://chem4word.codeplex.com/">https://chem4word.codeplex.com/</a>. I haven't used it myself but it does look promising.

John House (not verified) | Tue, 10/06/2015 - 17:30
PubChem does not appear to display CAS numbers for some compounds, such as 1-aminoethane-1,2-diol. Is there a reason for this (such as private access) and how can one go about retrieving a CAS number?

Alex Williams (not verified) | Tue, 10/06/2015 - 20:26
You can do so through the Cactus Website, however as stated, some of these are not public knowledge and therefore will not be on Cactus. <a href="http://cactus.nci.nih.gov/chemical/structure">http://cactus.nci.nih.gov/chemical/structure</a>

Leah Rae McEwen | Wed, 10/07/2015 - 17:51
As stated, CAS Registry Numbers are proprietary. There are many available out in the world although very few of these are authoritative. Most have been copied from other places, including most of those available through CACTUS. In addition to respecting intellectual property, two things are particularly relevant in the context of using this information: 1- CAS RNs as numeric strings do not contain any encoded information about the chemical entities in these registry records, none. If the number is not associated with additional information about the compound, such as structure or name, there is no way to know anything about the substance. 2- There is not necessarily one CAS RN per one discreet compound or defined substance and it is not a reliable 'identifier' in this way, although it has been forced into that role to meet a community need for a robust identifier (more discussion in next Module). Why there is NOT one CAS per compound is due to several reasons, including: A. The criteria by which CAS assigns chemical entities to records and IDs is not public knowledge and has been primarily driven by what the industry needs to register and what characterization is published in the literature (articles, patents, etc.) B. As more characterization data becomes available for many compounds, overlap occurs between some records and some RNs are determined to be obsolete and deleted. CAS keeps track of these so a CAS database user can find information on any former CAS RN that may have been published. C. How databases, and chemists, and informatics applications, define chemical entities varies and thus the map of CAS to entity will vary depending on which you are using and where the CAS RN data was from Authoritative sources of CAS RNs include: 1. CAS databases, which can be searched (at some cost) via SciFinder and STN, <a href="http://www.cas.org/">www.cas.org/</a> 2. Common Chemistry (from CAS), just under 8k CAS numbers, openly searchable although the associated structures are image only, <a href="http://www.commonchemistry.org/">http://www.commonchemistry.org/</a> 3. Wikipedia, which a version of Common Chemistry by arrangement, look for the green check marks in the ChemBoxes for authoritative numbers

Leah Rae McEwen | Wed, 10/07/2015 - 16:51
Hi everyone, looking up info on 1-aminoethane-1,2-diol was a bit of a trick question to prompt some discussion back to the 'big picture' issues mentioned in the text. The record for 1-aminoethane-1,2-diol has been deprecated from ChemSpider, not sure why, the note is here: <a href="http://www.chemspider.com/Chemical-Structure.10672474.html">http://www.chemspider.com/Chemical-Structure.10672474.html</a>. However, you can still see the skeleton record from a deposit of ChemSpider records into PubChem back in 2007. In the PubChem Compound record for 1-aminoethane-1,2-diol, are listed various related Substance records corresponding to original records deposited by various sources into PubChem. PubChem then maps these to standardized compound records. The 2007 ChemSpider record for 1-aminoethane-1,2-diol is included in the standardized PubChem Compound record for 1-aminoethane-1,2-diol. This example illustrates several points of interest: A: Databases change all the time and there is no knowing when a compound was or wasn't there and how it was identified, except by... B: Provenance trails such as that captured by PubChem in this case, which provides enough information to reconstruct some information about the substance as it was described in the originating source, where this was and when it was there C: If anyone referred to this compound in a paper by the ChemSpider ID (as is often done with IDs from the CAS database), it would not now be findable by an interested reader except through the PubChem mapping system, and perhaps any other database that pulled in ChemSpider data [Note: the CACTUS look up service is does pull the ChemSpider record, via PubChem] D. Chemical Abstracts Service (CAS) assigned a Registry Number ID to 1-aminoethane-1,2-diol in connection with a 2014 WO patent; CAS Registry Numbers are proprietary and it is highly unlikely this one has be reproduced out in the world outside of the CAS Registry Database.

Otis Rothenberger's picture
Otis Rothenberger | Wed, 10/07/2015 - 17:59
Hi All, This is just a fun fact about NIH/CADD Resolver that turns out to be useful on occasion. It's also a fun fact that reminds us that IUPAC is not a "name." it's a true connectivity identifier just like InChI or SMILES. OPSIN sits at the front door of Resolver to determine in a query is an IUPAC. If it's an IUPAC and the query is not specifically for look-up, then Resolver's structural algorithm takes over. So, try all of these on Resolver (or OPSIN): 1-aminoethane-1,2-diol 2-aminoethane-1,2-diol 1-aminoethane-2,1-diol 2-aminoethane-2,1-diol "Search" for InChI or IUPAC. While only one is standard IUPAC notation, all are equally correct from a connectivity point of view, and that's all OPSIN needs. Now Google any of these names! Google and IUPAC sat down to talk years ago (2006). These discussions are a cheminformatics classic and well worth a view: <a href="https://www.youtube.com/watch?v=mpZj4b9elYE">https://www.youtube.com/watch?v=mpZj4b9elYE</a> Otis

Brandon Davis (not verified) | Thu, 10/08/2015 - 16:46
What are some possible reasons, that Chemspider would remove an entry from their database?

Brandon Davis (not verified) | Thu, 10/08/2015 - 16:46
What are some possible reasons, that Chemspider would remove an entry from their database?

Otis Rothenberger's picture
Otis Rothenberger | Fri, 10/09/2015 - 19:45
Hi Brandon, One obvious reason would be that a structure for a published compound has been disproven. That’s not the case here, however. Although 1-aminoethane-1,2-diol appears as a published synthetic reagent, I’m guessing that this is based on its generation from other reagents in the reaction mix. Like gem diols, carbinolamines are usually intermediates - not isolatable compounds. To be sure, intermediates do appear in data bases. For example, benzyne is listed in PubChem and ChemSpider. Again, I’m guessing, but it may be that the actual intermediacy of 1-aminoethane-1,2-diol in the mechanisms of some of these reactions is being called into question. I think “ChemSpiderman” is on the faculty list :-) He could answer this question with a lot less guessing than I’m doing! Best, Otis

Leah Rae McEwen | Fri, 10/09/2015 - 20:41
Dr. David Sharpe from RSC ChemSpider provided some comments on this issue: [LRM] Why was 1-aminoethane-1,2-diol deprecated from ChemSpider (<a href="http://www.chemspider.com/Chemical-Structure.10672474.html">http://www.chemspider.com/Chemical-Structure.10672474.html</a>)? [DaSh] In this case the record is deprecated because there are no supporting data sources – that is no other resources that support that this compound has been made/isolated/measured. This has two consequences – the record might actually have been found to be erroneous information, and more importantly there is very little information of value (one algorithmically generated name a structure and MW and MF) most searches that would to get you to this record one would require you to know most of the information that is displayed already which would defeat the point of searching. I think an interesting contrast is to imagine chemical databases from the complete opposite perspective (ChemSpider creates records for structures when a data source provides us with the structure and some other facts) – but what would it be like to use a database that was created by enumerating structures within chemical space. For instance, Jean-Luis Reymond has enumerated 166 billion chemical structures; one could take that as a starting point and add facts to the records when a datasource provided a fact matching that structure. If one then added all of the data currently in the ChemCats collection of the CAS database (currently 105.9 million chemical structures) you would have a database where more than 99 percent of the records consist of only algorithmically generated data. That isn’t to say such database would be bad, but it might be difficult to find the reported compounds amongst the virtual ones. [LRM] How can one learn more about the deprecation of this record? [DaSh] As seen from the link, the reocrd doesn’t really disappear, if cited appropriately a URL or even a CSID (which can be extrapolated to a URL) will get you back to the data. For the last 5 years or so we have ensured that when a record is deprecated there must be an stated reason, the timestamp of when the deprecation occurred, and information on how to question the deprecation. eg. <a href="http://www.chemspider.com/Chemical-Structure.262628.html">http://www.chemspider.com/Chemical-Structure.262628.html</a>. 1-aminoethane-1,2-diol was deprecated prior to this practice. With regards as to why we remove the record from the search results, this is largely because a number of the records are completely erroneous eg. <a href="http://www.chemspider.com/Chemical-Structure.19826653.html">http://www.chemspider.com/Chemical-Structure.19826653.html</a> which is most likely the result of OCR software run on patent data mis-interpreting a table as a chemical structure. Antony Williams [original ChemSpider developer] has talked frequently how bad data in one database can be ingested in another database and as this propagation continues all of the databases start to agree and appear to corroborate the bad data. We believe that bad or very incomplete data needs to be made less accessible to help break that cycle. Further considerations about the scope of general database services [DaSh] In addition I think that 1-aminoethane-1,2-diol raises another very interesting question which is – with chemical structures it is possible to have data that conforms to all of the data model constraints of a database/index but might not meet other criteria, eg chemical stability. I would preface that the points I’m about to make are often dependent on subjective opinions or in context of the scope of a database: From the perspective of a synthetic chemist, this structure might be considered as ‘mythical’ or erroneous as it would be considered very unstable, depending on the conditions it would probably collapse to an imine losing water or an aldehyde losing ammonia. Certainly, I would tend to expect that such a structure might be considered a reactive intermediate. On the other hand there are very good cases where you might want to capture that the structure is reported in a paper (possibly as a part of a reaction mechanism, or detected in a flash photolylsis experiment or as a structure that has been considered in computational modelling) (I don’t have access to the journal article linked to this: <a href="http://www.nature.com/nchem/journal/v4/n11/compound/nchem.1467_comp23.html">http://www.nature.com/nchem/journal/v4/n11/compound/nchem.1467_comp23.html</a> but would guess that in this case structure is shown in a mechanism) This poses the interesting challenge for a general database such as ChemSpider, if we were only a database of commercially available compounds it would be clear that this structure would be out of our scope. Instead, we have users who will interpret an entry for the compound in different ways: the searcher who thinks in terms of only isolable compounds will either consider that the structure is in error because they believe it to be too unstable – or conversely may assume that because the entry is in the database it means that it can be made. While others who see the database as indexing all chemical information will be much more open to the idea that the this only means that the structure was mentioned in passing in a paper or other resource. [LRM] The consideration of how a user might interpret the information in a resource ties back into the lecture text for Module 4. It is prudent to assume that every person or system that has generated and further processed compound data (or any data) have different criteria for using and interpreting the structure and associated information about a compound. Aggregator databases such as ChemSpider and PubChem that provide 'history' or provenance trails to the original sources as we have seen provide users the option to make more informed decisions about how to interpret the records they find based on their own criteria for their purpose.

Alex Williams (not verified) | Tue, 10/06/2015 - 20:23
How far along is digital integration of lab notebooks? I would assume that this would help a lot considering we are now dealing with computers for nearly every aspect of the experiment process.

Dr. Briney | Wed, 10/07/2015 - 08:55
e-Lab Notebooks are becoming more common but are far from ubiquitous. I can think of a handful of universities that even offer them as regular software (eg. my alma mater UW-Madison now offers all researcher LabArchives), but that doesn't mean that every lab on those campuses uses them. Industry is a bit farther ahead on this in terms of e-lab notebook adoption (pharma research being an early adopter). What's unfortunately more common is individual labs/researchers adopting one-off solutions; this can help in the short term but you NEED to make sure that those records are still available in the long term. As far as functionality, it's spotty in chemistry. General e-lab notebooks often fall flat on the features that chemists want, such as structure drawing and structure searching. More relevant to this class may be a digital way to capture the details about how you run code (iPython Notebook is an example), combined with a good version control software (like Git/GitHub). We're at a transition point in this area, so expect adoption to increase going forward. There are still many options for digital notebook technologies available and we're only starting to see some clear winners and losers in this area. That means that until things are settled, your notebook could be at risk in the event that a company folds. So the short answer is yes, you can get a digital lab notebook but you'll want to find ways to minimize the risk of potentially losing all of your notes going forward. And as always, make sure your research advisor is on board before you jump into a new technology.

Leah Rae McEwen | Wed, 10/07/2015 - 18:03
We've been looking into chemistry oriented electronic lab notebooks (ELNs) at Cornell and are now being pleasantly encouraged by what we are seeing available for academic situations, as compared to even a couple years ago. In addition to support for chemical structure and substructure searching (and indexing) within the notebook, we've also been interested in: - ability to import chemically intelligent structure files from external drawing software (via molfiles), - stoichiometric calculation support, - import and view characterization data such as spectra files (and TRC plates, which are treated as images), - ability to export customized text-based outputs (for publication) - ability to export experimental metadata and datafiles in machine-readable form (to be interoperable with other systems, including potential for data repositories) - most notebooks that we've looked at meet legal requires for digital signatures, - access, security and backup considerations Some Chemistry Aware ELNs being marketed for Academic Labs: ~ these are currently all Cloud-based, pricing per user month or year Mbook by Mestralab, <a href="http://mestrelab.com/software/mbook/">http://mestrelab.com/software/mbook/</a> ~ same company that provides Mnova used by many many labs for spectra analysis and links directly between these systems ~ very nice data handling for chemists and for general concerns One-Click Chem ELN from Dotmatics, <a href="http://www.dotmatics.com/one-click-eln/">http://www.dotmatics.com/one-click-eln/</a> ~ this company formerly supported pharma industry so the data handling is excellent ~ one-click option streamlines workflow and pricing for academics, this is being trialled on various campuses ~ also has One-Click Bio ELN in development for cross-disciplinary labs that need a variety of functionality eNovalys Book, <a href="http://www.enovalys.com/book">http://www.enovalys.com/book</a> ~ quite detailed for total synthesis type chemistry, data and metadata fully exportable Elements from Perkin-Elmer, <a href="http://go.perkinelmer.com/Q315-ACSFall-Elements-LP">http://go.perkinelmer.com/Q315-ACSFall-Elements-LP</a> ~ from the makers of ChemDraw, very user-friendly for chemists, a bit lightweight with data handling and getting it back out (HTML) ~ however, they are considering a more robust research version for academic labs based on their successful industry enterprise system ~ review on ChemJobber: <a href="http://chemjobber.blogspot.com/2014/10/product-review-perkin-elmers-elements.html">http://chemjobber.blogspot.com/2014/10/product-review-perkin-elmers-elements.html</a>

Brandon Davis (not verified) | Wed, 10/07/2015 - 13:12
Did anyone else have trouble finding this compound on chemspider? It was easy to find on Pubmed, but I couldn't find it on chemspider through any kind of search.

Evan Hepler-Smith's picture
Evan Hepler-Smith | Wed, 10/07/2015 - 14:45
Every chemical structure database is different - you'll reasonably often find compounds that show up in one but not another. Of course, in real life, if you go looking for a compound that isn't showing up as you expect, there's a chance that you're looking for the wrong compound. If one of these compounds isn't showing up in PubChem or ChemSpider, you might think about what you might do next if you were actually sent looking for this compound. An optional extra exercise: can you use these DBs to find closely related compounds, just in case by some miscommunication you were given the wrong structure to start with?

Brian Murphy | Fri, 10/09/2015 - 19:32
Is there a good way to pull molecular structures for a long list of smiles or inchis instead of looking them up one at a time?

Leah Rae McEwen | Fri, 10/09/2015 - 20:00
There are several options available through CACTUS: <a href="http://cactus.nci.nih.gov/">http://cactus.nci.nih.gov/</a> , including a SMILES translator, and the lookup service (structures returned are display only), among others. OpenBabel: The Open Source Chemistry Toolbox (<a href="http://openbabel.org/wiki/Main_Page">http://openbabel.org/wiki/Main_Page</a>) provides many advanced tools for conversion. It is also possible pull multiple structures from InChIs in PubChem using a two-step process: 1- First use the Identifier Exchange to get CIDs from InChIs: <a href="https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi">https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi</a> 2- Then use the Download Service to request structures from multiple CIDs: <a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a>

John House (not verified) | Sat, 10/10/2015 - 10:12
For question 2.4.3, question b, is the question referring to part a)? And if so, all that I could possibly assume would be the (R) stereoisomer having two CAS numbers.

Evan Hepler-Smith's picture
Evan Hepler-Smith | Sat, 10/10/2015 - 18:07
Oops - you're correct, part (b) should refer back to part (a). Thanks for catching that typo. And, yes, you've got it - showing two CAS numbers for one stereoisomer is pretty strange. We've discussed elsewhere in the Q&A how this can happen - going through parts (c) and (d) of this question will give you a concrete example of how this can arise in practice.

Annotations