Download PDF:

Download 2017OLCCModule7.pdf (187 KB)

Download 2017OLCCModule7Assignment.docx (28.75 KB)

Sunghwan Kim, National Center for Biotechnology Information

Learning Objectives

Know how to formulate a PUG-REST request URL.
Know how to access PubChem data from a spread sheet (in Google Sheet)
Know how to access PubChem data from a python script.

Programmatic Access to Web-Based Chemical Information – Examples

Comments: 1

1. Useful resources for students with no programming background

In this module, we will learn how to “programmatically” access PubChem and other public databases. Cheminformaticians often write computer programs to process a large amount of chemical data and to automate some tasks that are routinely performed. Therefore, computer programming is an essential skill for future cheminformaticians. While this module does not require prior knowledge of computer programming, it is highly recommended that students with no or little programming background build some basic computer programming skills. There are many online resources that help you learn computer programming and below are some of them.

w3schools.com (https://www.w3schools.com)
Tutorialspoint.com (https://www.tutorialspoint.com/index.htm)
Code.org (https://code.org)
Code Academy (https://www.codecademy.com)
Exercism (http://exercism.io)
Cloud9 (https://c9.io)
Ideone (https://ideone.com)
LearnPython.org (https://www.learnpython.org)
Python Challenge (http://www.pythonchallenge.com)

In addition, an introductory material about Web APIs (Application Programming Interfaces) is available at the OLCC web site (as a Special Topic Module).

Programmatic Access to Data (by Jordi Cuadros)
(http://olcc.ccce.divched.org/Spring2017OLCCSpecialTopic-Programmatic%20Access%20to%20Data)

The present module focuses on accessing PubChem data, but many other public chemical information resources also provide programmatic access to their data. Some examples complied by the OLCC Cheminformatics Faculty are available at this URL:

Programmatic Access to Web-Based Chemical Information – Examples
(http://olcc.ccce.divched.org/Spring2017OLCCModule7TLO1)

2. Overview of Programmatic Access to PubChem

Currently PubChem^1-3 contains more than 235 million depositor-provided substance descriptions, 94 million unique chemical structures and 232 million biological test results from one million assays, covering more than 10 thousand unique protein target sequences. Many researchers in the biomedical science community have a great interest in programmatic access to this vast amount of data because it presents new opportunities for data-driven research in a “big data” era. PubChem provides several ways for programmatic access to its data, including:

Entrez Utilities (also called E-Utilities or E-Utils)
Power User Gateway (PUG)
PUG-SOAP
PUG-REST

E-Utilities, used for programmatic access to information contained in the Entrez system, are suited for accessing text or numeric-fielded data, they cannot deal with more complex types of data specific to PubChem, such as chemical structures and tabular bioactivity data. Thus, PubChem provides additional programmatic access routes specialized for PubChem data and analysis services: PUG, PUG-SOAP, and PUG-REST. While suitable for low-level programmatic access to PubChem, PUG exchanges data through a complex eXtended Markup Langauge (XML) schema that requires some expertise to use. For the sake of user-friendliness and for integration with a variety of third party tools, PubChem provides two easier-to-use web service access methods: PUG-SOAP, which uses the simple object access protocol (SOAP) and PUG-REST, which is a Representational State Transfer (REST)-style interface. An overview of these programmatic access routes to PubChem is given in the following paper:⁴

PUG-SOAP and PUG-REST: Web Services for Programmatic Access to Chemical Information in PubChem
Kim et al., Nucleic Acids Res. 2015, 43(W1), W605-W611.
(http://dx.doi.org/10.1093/nar/gkv396).

In this module, we will learn how to access PubChem using PUG-REST, because it is the simplest to use and learn. More in-depth information on programmatic access to PubChem is described in these documents:

PUG-REST
- PUG-REST Help (http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html)
- PUG-REST Tutorial (http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html)
- Programmatic Retrieval of Small Molecule Information from PubChem Using PUG-REST (in press).
  [Available to logged-in students at bottom of this module.]
E-Utilities
- Entrez Programming Utilities Help (http://www.ncbi.nlm.nih.gov/books/NBK25501/)
- E-Utilities Quick Start (http://www.ncbi.nlm.nih.gov/books/NBK25500/)
PUG
- PUG Help (https://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html)
PUG-SOAP
- PUG-SOAP Help (http://pubchem.ncbi.nlm.nih.gov/pug_soap/pug_soap_help.html
- PUG-SOAP Client Help (http://pubchem.ncbi.nlm.nih.gov/pug_soap/client_help.html)
- PUG-SOAP Web Service Reference (http://pubchem.ncbi.nlm.nih.gov/pug_soap/PUG_SOAP.html)

3. PUG-REST

3.1. Concepts and Syntax of PUG-REST requests

PUG-REST is the simplest to use and learn among the existing programmatic access methods to PubChem. Importantly, because information necessary for a PUG-REST request can be encoded into a single Uniform Resource Locator (URL) that can be written by hand without programming expertise. Conceptually, a web service request from the user to PubChem requires three pieces of information:

input: a list of PubChem identifiers of interest (e.g., CID, AID, SID).
operation: what to do with the input identifiers.
output: the format of the output from the operation.

In PUG-REST, these three pieces of information are encoded into an URL in the following format:

Some tasks require additional pieces of information that do not fit into the three-part PUG-REST URL. They should be provided as a list of ‘&’-separated option name and option value pairs, following the question mark (“?”) appended at the end of the request URL. Some examples are presented in next section, but there are much more things that users can do through PUG-REST. To get more detailed information on PUG-REST, read the following four articles:

PUG-SOAP and PUG-REST: Web Services for Programmatic Access to Chemical Information in PubChem
Kim et al., Nucleic Acids Res. 2015, 43(W1), W605-W611.
(http://dx.doi.org/10.1093/nar/gkv396).
Programmatic Retrieval of Small Molecule Information from PubChem Using PUG-REST (in press).
[Available to logged-in students at bottom of this module.]
PUG-REST Help (http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html)
PUG-REST Tutorial (http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST_Tutorial.html)

3.2. PUG-REST examples

Getting the full record of a compound record.

One of the most common tasks requested through PUG-REST is to retrieve all computed properties for a given chemical or chemicals. The following example URLs retrieve the full record of acetone in an XML format.
In the above examples, the input compound is specified by CID (CID 180), name (acetone), and InChiKey (CSCPPACGZOOCGX-UHFFFAOYSA-N). The input identifiers can also be specified by SMILES or InChI strings, although special care needs to be taken because these identifiers contain special characters (such as “/”) that cause conflicts with the URL syntax.⁴ In addition, the hits returned from a structure search (e.g., identity/similarity search or sub/superstructure search) can be used as the input identifiers for a PUG-REST request. [It is highly recommended to use the synchronous “fast” inputs for structure searches (e.g., fastidentity, fastsimilarity_2d, fastsimilarity_3d, fastsubstructure, fastsuperstructure, and fastformula). Read the “Asynchronous operations in PUG-REST” section of this paper.]

The keyword “record”, followed by the input identifier, invokes the full record retrieval of the input compound. By default, the “record_type” option is set to “2d”, meaning that the information derived from the 2-D structure of the input compound will be returned. By setting this option to “3d”, one can get the information derived from the 3-D structure. Therefore, examples (a) through (d) returns 2-D information, and example (e) returns 3-D information.
Getting the 2-D and 3-D structure image of a compound

Below are examples of the molecular structure image retrieval through PUG-REST.
When the output file format for full record retrieval is set to “PNG”, the image of the input compound is retrieved. The 3-D image of the input compound can be retrieved by setting the “record_type” option to “3d”. When a list of compounds are specified as the input, the image of only the first compound on the list will be returned.
Getting molecular properties of a set of compounds

One may download molecular properties for a set of compounds, as in the following example:
1. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/887,702,1031,263/property/MolecularFormula,MolecularWeight,CanonicalSMILES,HeavyAtomCount,XLOGP/CSV
In this example, the molecular formula, molecular weight, canonical SMILES, Heavy Atom Count, and XLOGP value for 4 compounds are retrieved in a CSV format, which can be read in a spread sheet program like Excel and Google Sheet. A list of the molecular properties available through PUG-REST can be found in the PUG-REST specification document:

https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html#_Toc458584223
Getting a list of CIDs whose synonym is or contains “atorvastatin”

Through PUG-REST, one can perform a search by synonym.
By default, the “name_type” option is set to “complete”, meaning that only those whose synonym completely matches the input chemical name will be returned. One can perform partial synonym matching, by setting this option to “word”. These search options are conceptually equivalent to an Entrez search with Entrez indices “[completesynonym]” and “[synonym]”.
Getting a list of CIDs for compounds identical to a query compound

Below are examples of PUG-REST request URLs for identity search.
Note that various “contexts” of identity can be selected using the “identity_type” option. By default, this option is set to “same_stereo_isotope”, which returns compounds identical to the query compound in both stereochemistry and isotopism. When it is set to “same_connectivity”, the identity search will returns molecules with the same connectivity (ignoring stereochemistry and isotopism).
Getting a list of CIDs for compounds with a given substructure

The following examples perform substructure searches through PUG-REST.
By default, the substructure search will keep explicit hydrogen atoms in the query substructure. By setting the StripHydrogen parameter to “true”, one can strip off hydrogen atoms from the query substructure before performing a substructure search.
Getting a list of CIDs for compounds with a given molecular formula

The following examples show how to perform a molecular formula search through PUG-REST.
1. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/C6H12O6/cids/TXT
2. https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/C6H12O6/cids/TXT?AllowOtherElements=true
By default, the search results from a molecular formula search will exactly match the entered stoichiometry. One may allow other elements in the returned results by using the “AllowOtherElements=true” option.
Getting a list of AIDs for assays that target a given protein.

One can retrieve assays that are performed against a particular target, which can be specified by gene symbol, NCBI’s Gene ID, or NCBI’s global identifier (gi) (for protein sequences).
Getting a list of compounds tested in an assay.

The following examples show how to get a list of compounds tested in a given assay.
Using the “cids_type” option, one can download all compounds tested in an assay or only those compounds tested to be active (or inactive).
Getting bioactivity data determined in an assay.

It is possible to download the bioactivity data for a given assay through PUG-REST.
1. https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1053202/record/CSV
2. https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1053202/record/CSV?cid=823611,1270560,2104765
It is also possible to download the bioactivity data for a particular compound or compounds tested in the input assay by providing the comma-separated list of CIDs of interest after the question mark.
Getting a list of assays in which a compound was tested.

Through PUG-REST, one can get a list of assays in which a given compound was tested. Below are some examples.
By adjusting the “aids_type” parameter to “active” (or “inactive), one can retrieve assays in which the compound is tested to be active (or inactive). By default, any assays in which the input compound is tested are retrieved.

3.3. Accessing PubChem data from a spread sheet program

Through PUG-REST, one can access PubChem data from a spread sheet software, such as Google Sheet or MicroSoft Excel. Below is an example Google Sheet file that shows how to auto-populate data from PubChem into the spread sheet.

https://docs.google.com/spreadsheets/d/12-UkxgCxNVp2Ceim47TRffhqquLi2TFPKj3PJpii7bY/edit?usp=sharing

In this example, the function IMAGE() is used to retrieve the molecular images in the PNG format, and IMPORTXML() is used to get the record name and some molecular properties. While the IMPORTXML() function is used to get data in an XML format, the IMPORTDATA() function should be used to get PubChem data in CSV or TXT format. Note that the PUG-REST request URL used in the IMPORTXML() contains a comma-separated list of the input CIDs to retrieve data for all CIDs at a single request. While it is possible to make ten PUG-REST requests (one for each CID) to get the same data, it is highly recommended that users should minimize the number of requests. On the other hand, the image retrieval through PUG-REST supports only one input CID at a time, and therefore ten PUG-REST requests are made to get the images for all CIDs.

It is also possible to use MicroSoft Excel to do the same task as done in the above Google Sheet example. The WEBSERVICE() function in MicroSoft Excel 2013/2016 is equivalent to IMPORTXML() and IMPORTDATA() in Google Sheet. It returns the data retrieved from a web service request (in both XML and CSV/TXT). However, MicroSoft Excel does not have a built-in function equivalent to IMAGE() in Google Sheet, one needs to write a script for the image retrieval in VBA (Visual Basic for Applications), which is beyond the scope of this module.

3.4. Accessing PubChem data from a script

This section introduces how to make a PUG-REST request from a script, with example Python scripts. These examples are simple and short enough for students with no programming background to understand. You can modify and run these scripts on your web browser. While the examples are written in Python, it is possible to do the same tasks using other programming languages.

Example 1: Search by chemical name (https://trinket.io/library/trinkets/f009cf46ee)

This example searches PubChem for compounds whose name is “atorvastatin” and retrieve their full record. All lines beginning with the “#” characters are comments, which are ignored during the execution of a script. These comments are usually used to provide human-readable explanations about the script.

Without the comments and blank lines, only the remaining three lines are actually executed. In line 8, the “urllib.request” module is imported, which contains the definition of the function “urllib.request.urlopen()” in line 11. This function makes a web service request to the URL provided within the parentheses. The returned result is stored in the “request” object. In line 14, the returned data are read from the object and printed out.

Example 2: Molecular property retrieval (https://trinket.io/library/trinkets/a42e5b43e8)

This example illustrates how to retrieve molecular properties of a list of CIDs. In this example, the three pieces of information required for a PUG-REST request (that is, the input, operation, and output) are stored in respective variables (Lines 12-14). These variables are used to construct a PUG-REST request URL, which is stored in a variable called “url” (Line 17). In Line 20, this “url” variable is used as an argument in the function “urllib.request.urlopen()”, which makes the PUG-REST request.

Note that Example 2 uses additional variables to store different parts of the PUG-REST request URL, making it longer than Example 1. It is possible to reduce Example 2 into a three-line script without using these additional variables, by directly specifying the PUG-REST URL in the parentheses after “urllib.request.urlopen” as shown in Example 1.

Example 3: 2-D similarity search using CID queries (https://trinket.io/library/trinkets/490a1841a3)

In this example, 2-D similarity search is repeatedly performed using a list of CIDs as queries. The query compounds are defined in Line 17. Because 2-D similarity search can take only one query compound at a time, it should be repeated using a “for” loop as shown in Lines 20 through 28. In this “for” loop, each CID in “queries” (defined in Line 17) is assigned to the variable “mycid”, which is used to construct a PUG-REST request URL in Lines 24-25.

Note that the input CID is converted into a string [using the str() function at line 24] before it is concatenated with the other parts of the URL. This conversion is necessary because the input identifiers are provided as numbers at Line 17. If they are provided as strings (with each of them enclosed in quotes), they could be directly used in the URL.

Example 4: 2-D similarity search using isomeric SMILES queries (https://trinket.io/library/trinkets/3437cdefb0)

This example also performs 2-D similarity search using a list of SMILES strings as queries. An important difference between Examples 3 and 4 is the way in which the input identifiers are encoded in the PUG-REST request URL. Some special characters used as input identifiers are also reserved for the URL syntax, causing issues when directly encoded in an URL path. An example is the “/” (forward slash) character used in isomeric SMILES and InChI strings. To avoid conflicts with URL syntax, these line notations need to be included in the options part of the URL (after the “?” mark) (as in Lines 27-28).

Programmatic Access to Web-Based Chemical Information – Examples

The above links brings you to a page where you can access and discuss examples created by faculty.

References

(1) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L. Y.; He, J. E.; He, S. Q.; Shoemaker, B. A.; Wang, J. Y.; Yu, B.; Zhang, J.; Bryant, S. H. Nucleic Acids Res. 2016, 44, D1202.

(2) Wang, Y.; Bryant, S. H.; Cheng, T.; Wang, J.; Gindulyte, A.; Shoemaker, B. A.; Thiessen, P. A.; He, S.; Zhang, J. Nucleic Acids Res. 2017, 45, D955.

(3) Kim, S. Expert Opinion on Drug Discovery 2016, 11, 843.

(4) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Bryant, S. H. Nucleic Acids Res. 2015, 43, W605.

Questions

Formulate a PUG-REST request URL to perform the following tasks.
1. Get the full record (in XML) for CID 60823 that are derived from its 3-D structure.
2. Get the small-size 2-D image of CID 2244.
3. Get the XLogP and TPSA values of CID 1983, 2244, and 3672 in the CSV format.
4. Get the list of CIDs whose name contains the string “crestor” in the TXT format.
5. Get the list of CIDs for compounds that are similar to CID 3348 in terms of 2-D similarity (at the threshold = 95)
6. Get the full record (in XML) of CIDs that have the substructure “C3=NC1=C(C=NC2=C1C=NC=C2)[N]3”
7. Get a list of AIDs (in TXT) that target the protein encoded by the KCNH2 gene.
8. Get a list of CIDs (in TXT) for compounds tested to be active in AID 376.
9. Get the bioactivity data (in CSV format) for AID 376.
10. Get the sources of the boiling point annotations in PubChem (Hint: see Module 5).
This question is designed to test whether you can access PubChem data from a spread sheet program.
1. Make a Google Sheet like this template (http://bit.ly/2o0pzoT) and auto-populate it with the CID, molecular formula, heavy atom count, and XLogP values for alkanes and alcohols (with 1 through 10 carbon atoms). Provide the link to the completed google sheet so that your instructor can access it.
2. Compare the XLogP values of alkanes with those of alcohols. Which one is greater, and explain why.
Modify the example python script 4 (https://trinket.io/python/3437cdefb0) to retrieve (in a CSV format) the molecular formula, molecular weight, hydrogen bond donor count, hydrogen bond acceptor count, and XLogP values of compounds returned from a 2-D similarity search using the following three smiles strings:
- C1CCNCC1
- C1CCC(CC1)(CC(=O)O)CN
- C1=CC=C(C=C1)NC2=CC=CC=C2C(=O)O
[Please create an trinket.io account, save your work in the trinket.io platform, and provide your instructor with the link to it.]

OLCC course:

Spring 2017 Cheminformatics OLCC

Join the conversation.

Comments 7

PUG-REST

Are we able to output all molecules in the PubChem data base with a molecular weight range, for example of 100 to 200, through the PUG-REST search? We know we can do this through the advanced search builder on PubChem, but is this able to be done programatically?

No, but E-Utilities can do a range search

PUG-REST does not have a way to perform a property range search (e.g., retrieving all molecules with the molecular weight between 100 and 200). This kind of search is supported by another programmatic access route, called Entrez-Utilities (E-Utils). (Actually, almost everything you do using the Entrez system can be automated using E-Utilities, but E-Utils are limited to text or numeric fielded data (e.g., chemical name search and search by numeric value like molecular weight) and cannot be used for structure searches (super/substructure search, similarity search, ...) that take a chemical structure as a query (in the form of SMILES/InChI or provided through the chemical structure sketcher).

So, depending on the type of your tasks, you need to use PUG-REST, E-Utils, or some other programmatics access routs, although this module does not cover all access routes. If you have a particular task but you don't know what to use, feel free to contact me.

Sort of going the other way

It seems like most of the examples we have done are based on the options available on the Entrez system. In advanced search you can do a Boolean "and", and if you had the options, upload a list of chemicals, and find the ones that had a molar mass over a range. But can you do that for a different property, like the melting point? (I know we can find who submitted melting pts to pubchem). So the question is, can a spreadsheet pull back things like melting points? boiling points, from PubChem? If so, we may be able to do a calculation in the spreadsheet that filters out things that are outside of a given range. Cheers, Bob

Only computed properties are indexed.

Currently, you can "search" PubChem by computed properties only, and it does not support experimental properties. To have such functionalities in PubChem, there are a couple of issues that should be addressed beforehand. One of the issues is that the experimental properties collected from various sources often do not contain necessary meta data (For example, the atmospheric pressure at which the boiling/melting temperature was measured). This makes it difficult to standardize the experimental values. The units employed should be standardized (e.g., Kelvin, deg-C, or deg-F). In addition, many experimental data are not really structure-specific (e.g., there are many different substances that can be represented as SiO2, but with very different properties). In theory, these issues can be addressed in some way, but we don't really have enough resources to work on this.

Experimental Data

I realise this discussion has come up under a module: "Programmatic Access to Public Chemical Databases" and since Reaxys is not a free database then what is possible in Reaxys is probably irrelevant. Nevertheless I offer a couple of general comments: 1. Creating databases for experimental properties is very time consuming and hence, since someone has to do the work, then it can be quite expensive; 2. Dr Kim in his comments raises questions such as: a. some physical properties are dependent on conditions of measurement (boiling point/pressure); b. physical properties are reported in different units and the values need to be normalised, and c. different states of matters (and certainly different purities of substances) can affect the data (so often not only the data but also notes on the data need to be included). In short, to create experimental property databases you not only need massive resources, but you also need a number of rules/guidelines relating to what to excerpt and how to organise (then search) the data. The other thing to add here is that very often the substance itself is of much lesser interest than what it does (i.e., its properties). Yes, cholesterol has 27 Cs, 46Hs and an O and we can draw its structure or even visualise it in 3D (kind of - since who really knows how the side chain is coiled in different environments), but it is what it does (i.e., its properties) that really gets us scientists excited. Recently I questioned one of the people at Reaxys responsible for the excerption and organisation of physical data and he informed me that in the vast majority of instances the original data in documents needed correction, normalisation or qualification in some way or other before the data was entered in the database. While he had never done a complete analysis of the situation, he suspected that data in well over 95% of the original documents needed to be re-evaluated in one way or other. Since Reaxys has well over 400 fields of data (of which approximately half involve numeric data and half have text data) and since the extent of experimental 'property data' in Reaxys is (I estimate) around 100 times larger than any other product, you can get a sense of the magnitude of the undertaking (and of the added value) that Reaxys makes (and provides). So, yes, in Reaxys you can pull spreadsheets of experimental data out and compute the data with in-house software. Further the data pulled out can go well beyond just substances (or molecular weights) and properties such as melting/boiling points. As a couple of examples you can easily find critical superconducting temperatures of lanthanide-containing substances, or IC50 data values relating to studies of substances for pharma targets or diseases. Cheers, Damon

Google Sheets

My students and I have been working on the Google Sheets assignment focused on section 3.3 of this Module. I have tried several ways to pull the CID information into the C column in the spreadsheet shown in the image below. First we tried the importxml with cell B3 exporting the CIDs as XML following the example spreadsheet using -- =importxml(B3, "//*[local-name()='cid']") then I switched B3 to export as TXT and tried importdata. Neither of these is giving me cid values in column C. I'm guessing there is something simple I am missing.

Comment File:

Google Sheets Morsch.png

You can't provide multiple chemical names as input.

It is possible to provide multiple CIDs as an input (in the form of a comma-seperated list) like this:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/887,702,1031,263/property/MolecularFormula,MolecularWeight,CanonicalSMILES,HeavyAtomCount,XLOGP/CSV

However, you can't do this with chemical name inputs, because many chemical names contain commas in them (e.g., 1,2-dichloroethan), and the server cannot tell whether a comma is used as a separator, or as a part of the name. When chemical names are used as an input, you need to process one name at a time.

7. Programmatic Access to Public Chemical Databases

Learning Objectives

Table of Contents

Programmatic Access to Web-Based Chemical Information – Examples

1. Useful resources for students with no programming background

2. Overview of Programmatic Access to PubChem

3. PUG-REST

3.1. Concepts and Syntax of PUG-REST requests

3.2. PUG-REST examples

3.3. Accessing PubChem data from a spread sheet program

3.4. Accessing PubChem data from a script

Programmatic Access to Web-Based Chemical Information – Examples

References

Questions

Comments 7

PUG-REST

No, but E-Utilities can do a range search

Sort of going the other way

Only computed properties are indexed.

Experimental Data

Google Sheets

You can't provide multiple chemical names as input.

Annotations