Discussion | DivCHED CCCE: Cheminformatics OLCC

I am confused on how a

I am confused on how a computer program can pick up on the information from the picture labeled D, seperating various groups into nodal structures, but not read a regular structure in the same way. What purpose does seperating the functional groups really serve?

Pubchem search

The link in the assignment to Pubchem is to a beta search interface. Do you recommend using this beta search? What are the advantages and disadvantages of the beta vs. the standard search box?

Deprecation of records from ChemSpider

Dr. David Sharpe from RSC ChemSpider provided some comments on this issue: [LRM] Why was 1-aminoethane-1,2-diol deprecated from ChemSpider (<a href="http://www.chemspider.com/Chemical-Structure.10672474.html">http://www.chemspider.com/Chemical-Structure.10672474.html</a>)? [DaSh] In this case the record is deprecated because there are no supporting data sources – that is no other resources that support that this compound has been made/isolated/measured. This has two consequences – the record might actually have been found to be erroneous information, and more importantly there is very little information of value (one algorithmically generated name a structure and MW and MF) most searches that would to get you to this record one would require you to know most of the information that is displayed already which would defeat the point of searching. I think an interesting contrast is to imagine chemical databases from the complete opposite perspective (ChemSpider creates records for structures when a data source provides us with the structure and some other facts) – but what would it be like to use a database that was created by enumerating structures within chemical space. For instance, Jean-Luis Reymond has enumerated 166 billion chemical structures; one could take that as a starting point and add facts to the records when a datasource provided a fact matching that structure. If one then added all of the data currently in the ChemCats collection of the CAS database (currently 105.9 million chemical structures) you would have a database where more than 99 percent of the records consist of only algorithmically generated data. That isn’t to say such database would be bad, but it might be difficult to find the reported compounds amongst the virtual ones. [LRM] How can one learn more about the deprecation of this record? [DaSh] As seen from the link, the reocrd doesn’t really disappear, if cited appropriately a URL or even a CSID (which can be extrapolated to a URL) will get you back to the data. For the last 5 years or so we have ensured that when a record is deprecated there must be an stated reason, the timestamp of when the deprecation occurred, and information on how to question the deprecation. eg. <a href="http://www.chemspider.com/Chemical-Structure.262628.html">http://www.chemspider.com/Chemical-Structure.262628.html</a>. 1-aminoethane-1,2-diol was deprecated prior to this practice. With regards as to why we remove the record from the search results, this is largely because a number of the records are completely erroneous eg. <a href="http://www.chemspider.com/Chemical-Structure.19826653.html">http://www.chemspider.com/Chemical-Structure.19826653.html</a> which is most likely the result of OCR software run on patent data mis-interpreting a table as a chemical structure. Antony Williams [original ChemSpider developer] has talked frequently how bad data in one database can be ingested in another database and as this propagation continues all of the databases start to agree and appear to corroborate the bad data. We believe that bad or very incomplete data needs to be made less accessible to help break that cycle. Further considerations about the scope of general database services [DaSh] In addition I think that 1-aminoethane-1,2-diol raises another very interesting question which is – with chemical structures it is possible to have data that conforms to all of the data model constraints of a database/index but might not meet other criteria, eg chemical stability. I would preface that the points I’m about to make are often dependent on subjective opinions or in context of the scope of a database: From the perspective of a synthetic chemist, this structure might be considered as ‘mythical’ or erroneous as it would be considered very unstable, depending on the conditions it would probably collapse to an imine losing water or an aldehyde losing ammonia. Certainly, I would tend to expect that such a structure might be considered a reactive intermediate. On the other hand there are very good cases where you might want to capture that the structure is reported in a paper (possibly as a part of a reaction mechanism, or detected in a flash photolylsis experiment or as a structure that has been considered in computational modelling) (I don’t have access to the journal article linked to this: <a href="http://www.nature.com/nchem/journal/v4/n11/compound/nchem.1467_comp23.html">http://www.nature.com/nchem/journal/v4/n11/compound/nchem.1467_comp23.html</a> but would guess that in this case structure is shown in a mechanism) This poses the interesting challenge for a general database such as ChemSpider, if we were only a database of commercially available compounds it would be clear that this structure would be out of our scope. Instead, we have users who will interpret an entry for the compound in different ways: the searcher who thinks in terms of only isolable compounds will either consider that the structure is in error because they believe it to be too unstable – or conversely may assume that because the entry is in the database it means that it can be made. While others who see the database as indexing all chemical information will be much more open to the idea that the this only means that the structure was mentioned in passing in a paper or other resource. [LRM] The consideration of how a user might interpret the information in a resource ties back into the lecture text for Module 4. It is prudent to assume that every person or system that has generated and further processed compound data (or any data) have different criteria for using and interpreting the structure and associated information about a compound. Aggregator databases such as ChemSpider and PubChem that provide 'history' or provenance trails to the original sources as we have seen provide users the option to make more informed decisions about how to interpret the records they find based on their own criteria for their purpose.

Downloading multiple structures from SMILES or InChIs

There are several options available through CACTUS: <a href="http://cactus.nci.nih.gov/">http://cactus.nci.nih.gov/</a> , including a SMILES translator, and the lookup service (structures returned are display only), among others. OpenBabel: The Open Source Chemistry Toolbox (<a href="http://openbabel.org/wiki/Main_Page">http://openbabel.org/wiki/Main_Page</a>) provides many advanced tools for conversion. It is also possible pull multiple structures from InChIs in PubChem using a two-step process: 1- First use the Identifier Exchange to get CIDs from InChIs: <a href="https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi">https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi</a> 2- Then use the Download Service to request structures from multiple CIDs: <a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a>

Removal of a Compound

Hi Brandon, One obvious reason would be that a structure for a published compound has been disproven. That’s not the case here, however. Although 1-aminoethane-1,2-diol appears as a published synthetic reagent, I’m guessing that this is based on its generation from other reagents in the reaction mix. Like gem diols, carbinolamines are usually intermediates - not isolatable compounds. To be sure, intermediates do appear in data bases. For example, benzyne is listed in PubChem and ChemSpider. Again, I’m guessing, but it may be that the actual intermediacy of 1-aminoethane-1,2-diol in the mechanisms of some of these reactions is being called into question. I think “ChemSpiderman” is on the faculty list :-) He could answer this question with a lot less guessing than I’m doing! Best, Otis

Pulling structures

Is there a good way to pull molecular structures for a long list of smiles or inchis instead of looking them up one at a time?

IUPAC & the InChI project

A complete list of all IUAPC sponsored InChI activities and related information can be found on the IUPAC web site:
http://www.iupac.org/body/802

More about InChI

I would like to add that there are more large databases which contain up to about 100 million InChIs/InChIKeys, some of which are not free (except NCI) to access:

NIH/NCI – 110 million
NIH/PubChem - 91 million (68 million online)
EBI UniChem – 91 million
RSC/ChemSpider – 34 million
Elsevier/Reaxys – 30 million

I would also like to mention that there are InChI for chemical reactions:

International chemical identifier for reactions (RInChI)

Guenter Grethe, Jonathan M Goodman and Chad HG Allen
J. Cheminf. 2013, 5:45, published online on 24 October 2013. Read here.

A number of freely available articles are in the Open Access J. of Cheminformatics about InChI:
http://www.jcheminf.com/search/results?terms=InChI

The most detailed technical article on InChI is:
InChI, the IUPAC International Chemical Identifier
Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein,
Dmitrii Tchekhovskoi Journal of Cheminformatics 2015, 7:23 (30 May 2015)

Please feel free to contact me for any questions or issues about InChI

Steve Heller

Probelms with looking up CAS numbers

As mentioned above the ACS number is - a proprietary resource that comes with a non-trivial fee. As such most people look up the number is other databases or even on the CAS database. The problem is that many, many numbers in non CAS databases are incorrect for a number of reasons. People don't pay attention to stereochemistry or what salt is associated with a given number, etc. and assign the wrong number to their structure.

removal of a compound

What are some possible reasons, that Chemspider would remove an entry from their database?