2.6 Accessing Data on Websites and Application Programming Interfaces (APIs)

To bring this module to a close it is important to discuss the ramifications of all that we have covered in the previous sections.  By now you will hopefully have come to the realization that i) there is a lot of information/data in the world, ii) organizing it and identifying it is a important and huge undertaking, and iii) accessing chemistry related data in this environment cannot be done by a Google search.  A lot of what chemical informatics is about is providing ways to deal with lots of data, describing best practices of how to identify and store information, and developing tools that allow scientists to find and retrieve quality information that can be used for teaching and research.

One of the fundamental parts of this is understanding where good chemistry data is, working out how to efficiently search it – based on chemical terms – and how to download it so that you can use it.  In the previous sections you have seen examples of how data can be processed and stored in databases, but how do you get the data? It turns out that bringing SQL databases and scripting languages together to build webpages is only one facet of their usefulness.  Increasingly websites are being built to serve up data, that is, they are designed to allow the user to search databases via well-defined standards and return the results for both human and computer use.

Over the last few years the design of websites that follow the Representational State Transfer (REST) paradigm have become popular as their design allows websites to become web services.  What that means is if you know how to construct the URLs for pages on these websites you can anticipate the information that you are going to see on the page that is returned.  This concept is so much easier to see in practice.

One website that has a large amount of chemical metadata (information about chemicals – not chemical data like melting points etc.) is the NIH Chemical Identifier Resolver (CIR) website (see http://cactus.nci.nih.gov/chemical/structure). The website allows you to get information by writing it in the general format below.

http://cactus.nci.nih.gov/chemical/structure/"structure identifier"/"representation"

In this context “Structure Identifier” means any of the following: chemical names, SMILES, InChI Strings and Keys, and NIH identifiers like FICTS and FICuS.  What you get back from a search is determined by the “representation” part of the URL and includes all of those above and ‘sdf’ (otherwise known as MOL file), ‘formula’ and ‘image’.  So, using the URL below:

http://cactus.nci.nih.gov/chemical/structure/arsinic acid/names

prints out the names of this compound that are in the CIR system – for humans to read.  If you use a computer to access the site (i.e. via a scripting language like PHP) you can also request the data be sent to you as XML using.

http://cactus.nci.nih.gov/chemical/structure/arsinic acid/names/xml

which returns the following document.

<request string="arsinic acid" representation="names">
    <data id="1" resolver="name_by_opsin" notation="arsinic acid">
    <item id="1" classification="pubchem_iupac_name">arsinic acid</item>
    <item id="2" classification="pubchem_substance_synonym">CHEBI:29840</item>
    <item id="3" classification="pubchem_substance_synonym">HAsH2O2</item>
    <item id="4" classification="pubchem_substance_synonym">[AsH2O(OH)]</item>

    <item id="5" classification="pubchem_substance_synonym">arsinic acid</item>
    <item id="6" classification="pubchem_substance_synonym">dihydridohydroxidooxidoarsenic</item>

This is useful because it contains additional information (metadata) about the type of name that an entry is and it makes it easy for a script to take the data and do something with it – like put it in another database.

There are many sites that chemists can use to get chemical data, ChemSpider (http://www.chemspider.com/), PubChem (https://pubchem.ncbi.nlm.nih.gov/), and the NIST Webbook (http://webbook.nist.gov/chemistry/).  Each has a different set of data but all have a standard way to interact with their website.  To encourage users to take full advantage of the search features many sites will publish the Application Programming Interface (API) that delineates what you can search for and how to do it.  Facebook has an API, Twitter has an API, Instagram… the list goes on.  API’s are the way in which websites can ‘’talk” to other websites to integrate data, or present ‘widgets’ that allow you to see what's happening on one site when you are on another.  A good example of a well documented and sophisticated API for chemistry data is the one at PubChem (see https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html) which has a nice tutorial at https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html.  With over 68 million compounds to search you are sure to find something!

Additional Material

Molar Mass from Cactus in Excel 2013

Excel and Cactus

The Above Video describes how to use the NIH Cactus Chemical Resolver to make a column of molar masses from chemical names. Note, the molar mass fields need to be number fields with 4 digit spaces, and the fields with the URLs need to be text fields. You probably want to watch the video in full frame mode so you can read everything, and further information is available by clicking on the title.




Additional Resources

No votes yet
Molar Mass from Cactus in Excel 2013
Molar Mass from NIH Resolver and Google Sheets
Another Way to Get Molar Mass from NIH Chemical Resolver and Google Sheets
Join the conversation.

Comments 7

Alex Williams (not verified) | Mon, 09/07/2015 - 23:00
I never knew that these sites published APIs, this kind of data would be great to create one's own pathway finder for synthesis.

OLCC s12's picture
OLCC s12 | Thu, 09/10/2015 - 13:42
After reading through this section on APIs and then doing the assignment on putting data into Excel, how would you recommend importing any missing data using APIs for a very large list? For my example I pulled data from the OPCW list of chemical weapons which included a number, chemical type, examples and CAS numbers. The molecular weights were not included and going along with my question, how would I find the quickest way to import these molecular weights using APIs in Excel. At Least without having to copy and past individual numbers that were manually looked up in a web browser based on the API URL structure you explained using cactus.nci.nih.gov. It seems A large list would likely take many weeks to just find the molecular weights.

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 17:04
I was not expecting students to bring a large number of chemicals into Excel for this exercise so I was not expecting this to be a burden. I would talk to your advisor about how many of the chemicals you have imported and how many molecular weights they would like to see. This is not about making it an onerous task, more about the process and learning by doing. If you wanted to be creative with this you could write an Excel function to generate a URL to access the molecular weight data on to the Chemical Identifier Resolver site (cactus.nci.nih.gov) based off of the name of the compound. I will leave you to think about how to do this... Now your question about importing data is actually a great segway into another topic that is coming up later in the course - using Excel as a database and importing data from remote sources. You can do this in Excel for Windows but not Mac and you have to find the right source. I am not going to say any more so as not to steal the thunder of one of my fellow lecturers...

Alex Williams (not verified) | Tue, 09/15/2015 - 00:30
Is there any alternative to Microsoft Access for the mac? I am not sure how I would do all of this database work without it since its such an industry standard.

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 07:43
Sadly, MS does not make an equivalent to MS Access. Apple has Numbers, and Apple's subsidiary Filemaker has a good database. If you want more a simple relational database you might try the open source 'DB Browser for SQLite' (<a href="http://sqlitebrowser.org/">http://sqlitebrowser.org/</a>). For me personally there is no substitute for MySQL (<a href="http://dev.mysql.com/downloads/mysql/">http://dev.mysql.com/downloads/mysql/</a>) as the Community Server edition is free (no support from oracle but there is great online documentation and a wealth of knowledge on the web about have to use it). The downside to this is it requires a webserver, however you can install all you need if you use MAMP (<a href="https://www.mamp.info/">https://www.mamp.info/</a>) which installs open source versions of Apache, MySQL, PHP and phpMyAdmin a web browser for MySQL written in PHP on either Mac or PC. It is awesome!

Joshua Henrich (not verified) | Tue, 09/15/2015 - 00:59
Dr. Chalk, What exactly is metadata? Hopefully, i'ts not too basic of a question. I'm seeing most of these terms for the first time.

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 08:03
Metadata is the term used to describe information that contextualizes other information, or data about data. Sounds very abstract but actually makes sense if you think about an example. Consider the data in the citation managers in the last module. If you consider the content of a scientific paper 'data' then the metadata for that are it descriptors; title, author(s), journal, volume, issue, pages, year, author address, etc... The metadata is the information that characterizes a piece of data and as a consequence allows you to search for it in a database. Metadata as a term comes out of the library community and it's everywhere. Think about searches you do online at amazon.com for instance. When you search for say a thumb drive, it is will show you items you can buy and present you with options on the left-hand side that you can refine the search by, size, brand, color, special features - all of these are metadata about the different thumb drives that allow you to narrow down the search.