Jordi Cuadros, (jordi.cuadros@iqs.url.edu)

Learning Objectives

Get a basic understanding of Web APIs are accessed.

Introduction to web APIs

One common way to offer data to be used programmatically is through a web API. This learning object will be devoted to discuss how you access and use a web API.

What’s a web API?

An API (Application Programming Interface) is the set of elements that a programming library or service makes available to other programmers to be used remotely to access their services and data.

This API is called a web API when these functionalities are delivered via the HTTP protocol through the Internet.

Many web APIs exist, listings of public APIs can be found in https://www.publicapis.com/, https://any-api.com/ or https://www.programmableweb.com/.

How is a web API used?

In order to be able to get information from an online database, we will need to:

1.Send a query to the database through the Internet (HTTP protocol)

2.Understand the format in which the response will be received

3.Process this response to extract the desired information

Sending the query... - The HTTP protocol and the REST architecture

HTTP stands for HyperText Transfer Protocol and it comprises the set of rules that control the transfer of a resource between a web server and a web client (usually a browser).

The basic elements of an HTTP transaction are:

A request message
A response message (which is usually the resource being transferred)

The HTTP request message

The request message consists in some text data sent to an URL (Uniform Resource Locator). An URL is a way of specifying an address in the Internet.

The general structure for an URL consists of

scheme://[user:password@]domain:port/path?query_string#fragment_id

although in HTTP this commonly simplified to

http://domain/path or https://domain/path

The request message, which is usually handled by the HTTP client, has the following structure

A first line which includes method, path and HTTP protocol version, usually

METHOD /path HTTP/x.x

Several header lines
A blank line
And an optional message body

The HTTP request methods: GET and POST

Although other methods exist, the most common request methods are GET and POST.

In the GET method, any additional information required to specify the request is included in the URL (which means in the path included in the first line of the message). The parameters are included after the path and a question mark as pairs name=value. Parameters are separated by the ampersand symbol (&)

/path?name1=value1&name2=value2

In some cases, as we will see in PubChem API, the parameters are included in the path. This technique is called URL rewriting and shortens the URLs and makes it more usable.

/path/name1/value1/name2/value2

or even

/path/value1/value2

In POST methods, any additional information is not included in the URL but in the body of the message and is usually submitted via a web form. We won’t go further into this method since most of the chemistry APIs can be queried via GET calls.

To end this brief introduction to HTTP, note that some characters are either disallowed or have special meaning in URLs. When these characters appear they will need to be encoded; for example a space is encoded as %20, and a sharp (#) is %23. A reference can be found at http://www.w3schools.com/tags/ref_urlencode.asp.

The HTTP response message

As we have seen for the request, the response contains:

A first line
Several header lines
A blank line
And a message body which is the transferred resource

Two aspects of the response to be highlighted are the status code (included in the first line) and the content-type information (which is one of the headers).

HTTP status codes are indicated as a three-digit number

200 means transaction completed successfully
3xx means some type of redirection
4xx means a client-side error; e.g. 404 means resource not found
5xx means a server-side error; e.g. 500 means unexpected server error

The content type parameter indicates what data is included in the message body. Each type of resource is indicated with a standardized string (Internet media type; aka MIME type). For example, some common response formats are

HTML file: text/html
XML file: application/xml or text/xml
CSV file: text/csv
JSON file: application/json
PNG image: image/png

Querying the databases

We have so far seen how the information is transmitted over the WWW. But how are databases accessed over the WWW? There are two main architectures used to access databases over the Net: SOAP and REST. Given its prevalence, we will center our discussion on the REST architecture.

Some major characteristics of the REST (REspresentational State Transfer) architecture are:

It is usually implemented over HTTP.
The queries are implemented as GET requests.
The communication is stateless. Any query must contain all required information. Nothing is saved in the server.
The query result is obtained as a response in a predefined format: XML, JSON, CSV, HTML, plain text...

Understanding the responses... - Common structured data filetypes

Common formats in which a web API may provide its response are HTML, XML, CSV, JSON and plain text. A short explanation of the first four types follows next. Feel free to skip it if you already know their characteristics.

HTML (HyperText Markup Language)

HTML is the standard language used to create web pages. When a web API outputs its result in HTML, this resource is usually intended to be viewed a browser. When no better option is available, HTML can be processed (parsed) to extract a specific information.

Likewise, any web page is published as HTML and so can be processed to extract any relevant information contained therein.

An example of HTML can be found here: http://kinetics.nist.gov/janaf/html/Cl-054.html

This the output of a search in the NIST-JANAF Thermodynamical Tables online database. The full HTML code can be read looking at the source of the page. A fragment is reproduced in Figure 1.

Figure 1. Part of the HTML code of a page

More information on HTML can be found at http://www.w3schools.com/html/.

XML (eXtensible Markup Language)

XML is a markup language designed to contain structured data in format easy to be read by computers and for humans. An example of an XML response can be found running the following query: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1/property/MolecularFormula,MolecularWeight,InChIKey/XML. Figure 2 shows its response.

Figure 2. An XML response

More information on XML can be found at http://www.w3schools.com/xml/.

CSV (Comma Separated Values)

CSV, for Comma Separated Values, is a plain-text file format use to store tabular data. In practice, CSV is not a unique format but different variations depending on the implementation and the regional setup of the operating system.

CSV main features:

Data is stored as plain text.
Records are separated by new-line characters.
Each record contains several fields. Columns (fields) are separated by a delimiting character (usually a comma, a semicolon or a tab).
Each record must have the same number of fields.

CSV main variations:

Any encoding may be used for the text file: ASCII, Latin-1, UTF-8...
Values may be quoted or not.
End-of-line character, delimiting character, quotes and escaping sequences may vary.
There may be header lines.

An example of a response in CSV is obtained when running the following query http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV.

The response is shown in Figure 3.

Figure 3. Fragment of a CSV response

More information can be obtained at https://en.wikipedia.org/wiki/Comma-separated_values.

JSON (JavaScript Object Notation)

JSON, for JavaScript Object Notation, is a format that encodes data object as human-readable text, organized as pairs attribute-value (name:value). Its main features are being a quite compact notation with easy conversion to Javascript objects (eval(), JSON.parse()).

An example query is https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/property/MolecularWeight/JSON. The response is shown in Figure 4.

Figure 4. A JSON response

More information can be found at http://www.json.org/.

Processing the responses... - Parsers and string functions

Ultimately the response will need to be processed to extract the desired information. This can be done using specific functions or parsers or through the use of some simple string manipulation functions.

Sometimes you want to parse XML or HTML using custom string functions for deleting whitespace, finding the position of certain characters within strings and for extracting substrings. These functions (plus many more) appear in all commonly used languages.

Table 1 summarizes some of these functions in the programming environments we will be discussing next week.

Table 1. Main string functions

Spreadsheet (Excel, Calc)	Basic (VBA, Libre Basic)	Javascript
SEARCH, FIND	InStr	search, indexOf
MID	Mid	substr, substring
SUBSTITUTE	Replace	replace
TRIM	Trim	trim
CONCATENATE	& (operator),	concat

Rating:

No votes yet

Join the conversation.

Comments 23

Need Help!

Hi All,

I think this is mostly directed at Jordi, but thought I'd post it here, so the discussion is public. The jist of this email is that I am pulling data from one table, but would like to pull from more than one.

I have been working on cleaning up a sheet for webscraping WikiPedia Chemboxes, and the following link should be public.

https://docs.google.com/spreadsheets/d/11iWH9L8T7Kyue3UYg4ZUiYZ47LlBeNZPWGfIlEFxyrk/edit#gid=1454316265

If you click cell [C1] a drop down box created by data validation allows you to choose a property from the Wikipedia Chembox, which is then uploaded through the following query
=QUERY(IMPORTHTML(B2,"table",1),"select Col2 where Col1='"&$D$1&"'",0)

Note, as I had to use a filter in [D1] to pull the value out of the validation box in [C1], and essentially said that if the dropdown box says "density", set D1 to "Density", and then Density was used in the query.

the IMPORTHTML command takes data from the first table, and for the compounds in red, the chembox is the second table (the first is stuff from the wikipedia editors.)

How can I get the above script to pull from the first or second table?

In can use this and it works by itself, but not in the query.
={IMPORTHTML(URL, "TABLE", 7) ; IMPORTHTML(URL,"TABLE",8) ; IMPORTHTML(URL,"TABLE",9) }

I can bring in two tables, but I can't get that to work in the query.

I did go in and change the index for glucose in [C20] from 1 to 2, and now it works, but that is only a short term fix, and what is really needed is the ability to pick from the first or second table, as we have no idea when Wikipedia editors will be placing their message/comment tables on pages.

Any help is appreciated, and I was thinking this might morph into a project. For example, I think I have a strategy that can pull from multiple infoboxes

Google Sheet issues

Bob,
This sounds really cool! Unfortunately the behavior of the Google Sheet is not the same as you've described. I don't see a dropdown box in cell C1. Maybe that's because I don't have full permission to edit the page. The URL you posted says edit in there, but the permission I see at the top of the page says I only have permission to view.
Jennifer

updated permissions

OK, I think you can now edit. What I suggest/request, is that anyone who wants to edit make a duplicate of the sheet to that (tabs at the bottom), and play around, and don't worry about messing anything up.

I had an old version that used 2 columns for every chemical, which was much less elegant, but could take data from multiple tables in a page.

Cheers,
Bob

My current approach...

is here: https://docs.google.com/spreadsheets/d/15WjPrkY16HC_XgAuQWJ78F0-l_bj7DP1-H9d6d7MG8o/copy

I would say it is far from good but it does solve some issues: it looks for the infobox and it is a bit more resilient to the different names used for the properties in different compound pages.

Hope it helps.
Best,

Jordi

A slightly better option...

could be:
https://docs.google.com/spreadsheets/d/1Qw5UVxDSBiGKjxACs-PaUVjkZlZCCH1IM7brcfo9xS8/copy

I don't think I will be able to do more than this without scripting.
Best,

Jordi

A couple of Questions

Hi All, I think I am addressing Jordi, although anyone is welcome to chime in.

I tweaked your spreadsheet smiles2spectra to name2spectra and it is here,
https://docs.google.com/spreadsheets/d/1EGfeOYLYWT-wM_cHMN8zxDT1WoltOnDSjWRfFg0B35g/copy

I have two questions,
First, in this code,
=MID(IMPORTXML("https://chem.nlm.nih.gov/chemidplus/name/" & B1,"//link[@rel='canonical']/@href"),40,60)
I do not understand what is happening after the "&B1&", and I want to make a video explaining what we are doing. Can you explain?

Second, in cell 24 I made a jab at getting the jdx. Any advice? I would like to download the spectral data as two columns, can we do that?
Cheers,
Bob

additional question

Maybe I should mention to everyone, if you type a name into cell B1, it brings up an IR/UVVis/MS spectra (dictated by cell A5), and a jdx of the data.

Is there anyway we can lock some cells on a spreadsheet, while leaving others open? The ideal would be to lock all cells but B1. Can that be done?
Cheers,
Bob

First: XPath

For those wondering the SMILES2spectra spreadsheet is here https://docs.google.com/spreadsheets/d/1iKEXWUk4JQQdueklu8-PFbg-b0GNHYC9hIvoywEOY4I/copy.

The expression that is the second argument of the IMPORTXML function is an XPath expression. XPath is a computer language to pull specific data from an XML expression. More information on XPath can be found at:
https://www.w3.org/TR/xpath/
https://www.w3schools.com/xml/xpath_intro.asp
https://www.tutorialspoint.com/xpath/index.htm

"//link[@rel='canonical']/@href" reads search for a "link" tag wherever in the page such as it has an attribute "rel" which equals "canonical" and pull the value of its "href" attribute.
For benzene, it will pull the value "https://chem.nlm.nih.gov/chemidplus/rn/71-43-2" from line 106 of view-source:https://chem.nlm.nih.gov/chemidplus/name/benzene.

Does it make sense?
Cheers,

Jordi

Third: Protecting range and sheets in Google Sheets

This canbe done using the Data/Protected sheets and range...
I would suggest protecting the sheet Sheet1 but the ranges B1:D1 and B2, by using a warning on edit.
Cheers,
Jordi

Second: Import JCAMP-DX from NIST

Take a look to an improved version of the SMILES2Spectra sheet here:
https://docs.google.com/spreadsheets/d/1DG9fTdSkDE_f2DqEIW0ib2HOlW6p-BQH0TMqZbJUYBE/copy

JCAMP-DX import is in Sheet3. The chart works only when the data is in XYDATA format (usually IR spectra). More work needs to be done to process peaks or points.

MS Excel function equivalent to image() in googlesheet?

Hi, Jordi,

Does MS Excel have a function equivalent to image() used in googlesheet?

Sunghwan,

Not yet, but...

it can be emulated from a VBA function.
Take a look here: https://drive.google.com/file/d/0B5ln9gFZnHD4a0RtWWJCdWpKclE/view

Does it help?
Cheers,
Jordi

question on XPath

How would one load this page to a google sheet? https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/XML/?heading=Boiling+Point&response_type=display

I am assuming you place this in a cell (D52) https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/XML/?heading=Boiling+Point

And then something like =IMPORTXML(D52"//XPath_query")
But I do not know the XPath_query, or how to understand the Xpath_query

The information at this URL

The information at this URL seems to be longer than what IMPORTXML can proccess. It won't be an XPath issue. For more info on XPath looks at the following links (repeated from a previous post): <a href="https://www.w3.org/TR/xpath/">https://www.w3.org/TR/xpath/</a> <a href="https://www.w3schools.com/xml/xpath_intro.asp">https://www.w3schools.com/xml/xpath_intro.asp</a> <a href="https://www.tutorialspoint.com/xpath/index.htm">https://www.tutorialspoint.com/xpath/index.htm</a> Best. Jordi

so is there a way to upload

so is there a way to upload that to a Google Sheet? Cheers (and thanks)

I can't see any easy way to

I can't see any easy way to do it: the contents won't fit in a cell and when splitted apart it takes more cells than allowed in Goggle Sheets. Either some external processing or some coding will be required. Cheers, Jordi

Here you have it

https://docs.google.com/spreadsheets/d/1PdBXwZFa_epfjMq4ZrFLiDxtIzJYtWtZtXew_Uc2XBY/copy

Question on Excel

I have a list of 100,002 InChI keys, and I am trying to see if those chemicals are on Wikipedia. A student created the following spreadsheet in Google, <a href="https://docs.google.com/spreadsheets/d/1Kit4f7frOnz-2SWSM6KV9ULZjffm8QcfQJx09GWSz3M/copy">https://docs.google.com/spreadsheets/d/1Kit4f7frOnz-2SWSM6KV9ULZjffm8QcfQJx09GWSz3M/copy</a> I can get column "B" to work, but am dumbfounded on column "C". There may be a different way to do this, but our objective is to take a list of InChI Keys and determine how many are in Wikipedia, and then, how many of those have GHS codes in the chembox. Any ideas?

another way

I think there is another way. When you go to the URL with the InChI key, you get the page asking if you want to create the page, as wikipedia pages are not named after InChI keys. What we need, is the name of the wikipedia page, and that is embedded in the script here, <a href="/wiki/Maneb" title="Maneb" data-serp-pos="0">Maneb</a> I think the xpath stuff Jordi was showing us can do this, we just need to figure how to do it with Excel, instead of Google sheets.

Simplified Google Sheets version

A simplified version following the approach Bob just suggested is available at https://docs.google.com/spreadsheets/d/1CJwl2Uz5ZQDsy0V7I0khi5IJM2mhMdMbLycaIrnIJyU/copy I also changed the URL to search only for the connectivity layers of the InChI. This way more chemicals are found although stereoisomerical and isotopical information is not preserved. Cheers, Jordi

Couple of Questions

Hi Jordi, Thanks for sharing this. I have two questions, and please pardon my not having time to look up everything ab initio, but I did spend multiple hours on this today. First, on the code, =iferror(importXML(B2, "//a[@data-serp-pos='0']"),"InChi Key Not in Wikipedia") can you explain what is going on, especially the [@data-serp-pos='0] part. And second, we need to do this with Excel, as Google Docs said there was too much data. I will try and attach to this email the file I am working on, which I downloaded from PubChem, and contains chemicals with LCSS information. I am also curious, if you look at the raw text in the file, each InChI key ends with "N", and is followed by the PubChem number, as if it was one string. But when you open it as tab delineated, in Excel, it recognizes that there is a tab there, but why? Does anyone know? (Hopefully there will be a file attached to this comment) Cheers, Bob

Comment File:

2837755192517092584.txt

For the xpath expression

"[@data-serp-pos='0']" means that the XML element has an attribute named data-serp-pos which value is '0'. I don't know what the attribute means but it seems to be related to the position of the link in the search results. Thus a value of '0' would mean the first result. Cheers, Jordi

For the text file format

What you have between the InChIKey and the next CID is a line-feed character. This character is not rendered in some text editors (e.g. Windows Notepad) but it is in others (e.g WordPad). In Unix/Linux, it means the start of a new line and this is the way it is interpreted in Excel. In Windows, a newline is indicated with two characters, carriage-return + line-feed and this is way it doesn't show in notepad. More detailed information can be found at https://en.wikipedia.org/wiki/Newline. Cheers, Jordi

Programmatic Access to Data

Learning Objectives

Introduction to web APIs

What’s a web API?

How is a web API used?

Sending the query... - The HTTP protocol and the REST architecture

The HTTP request message

The HTTP request methods: GET and POST

The HTTP response message

Querying the databases

Further reading

Understanding the responses... - Common structured data filetypes

HTML (HyperText Markup Language)

XML (eXtensible Markup Language)

CSV (Comma Separated Values)

JSON (JavaScript Object Notation)

Processing the responses... - Parsers and string functions

Comments 23

Annotations