Programmatic Access to Data

Jordi Cuadros, (jordi.cuadros@iqs.url.edu)

Learning Objectives

  • Get a basic understanding of Web APIs are accessed.

 

 

Introduction to web APIs

One common way to offer data to be used programmatically is through a web API. This learning object will be devoted to discuss how you access and use a web API.

 

What’s a web API?

An API (Application Programming Interface) is the set of elements that a programming library or service makes available to other programmers to be used remotely to access their services and data.

This API is called a web API when these functionalities are delivered via the HTTP protocol through the Internet.

Many web APIs exist, listings of public APIs can be found in https://www.publicapis.com/https://any-api.com/ or https://www.programmableweb.com/.

 

How is a web API used?

In order to be able to get information from an online database, we will need to:

1.Send a query to the database through the Internet (HTTP protocol)

2.Understand the format in which the response will be received

3.Process this response to extract the desired information

 

Sending the query... - The HTTP protocol and the REST architecture

HTTP stands for HyperText Transfer Protocol and it comprises the set of rules that control the transfer of a resource between a web server and a web client (usually a browser).

The basic elements of an HTTP transaction are:

  • A request message
  • A response message (which is usually the resource being transferred)

 

The HTTP request message

The request message consists in some text data sent to an URL (Uniform Resource Locator). An URL is a way of specifying an address in the Internet.

The general structure for an URL consists of

scheme://[user:password@]domain:port/path?query_string#fragment_id

although in HTTP this commonly simplified to

http://domain/path  or  https://domain/path

 

The request message, which is usually handled by the HTTP client, has the following structure

  • A first line which includes method, path and HTTP protocol version, usually

METHOD /path HTTP/x.x

  • Several header lines
  • A blank line
  • And an optional message body

 

The HTTP request methods: GET and POST

Although other methods exist, the most common request methods are GET and POST.

 

In the GET method, any additional information required to specify the request is included in the URL (which means in the path included in the first line of the message). The parameters are included after the path and a question mark as pairs name=value. Parameters are separated by the ampersand symbol (&)

/path?name1=value1&name2=value2

In some cases, as we will see in PubChem API, the parameters are included in the path. This technique is called URL rewriting and shortens the URLs and makes it more usable.

/path/name1/value1/name2/value2

or even

/path/value1/value2

 

In POST methods, any additional information is not included in the URL but in the body of the message and is usually submitted via a web form. We won’t go further into this method since most of the chemistry APIs can be queried via GET calls.

To end this brief introduction to HTTP, note that some characters are either disallowed or have special meaning in URLs. When these characters appear they will need to be encoded; for example a space is encoded as %20, and a sharp (#) is %23. A reference can be found at http://www.w3schools.com/tags/ref_urlencode.asp.

 

The HTTP response message

As we have seen  for the request, the response contains:

  • A first line
  • Several header lines
  • A blank line
  • And a message body which is the transferred resource

 

Two aspects of the response to be highlighted are the status code (included in the first line) and the content-type information (which is one of the headers).

HTTP status codes are indicated as a three-digit number

  • 200 means transaction completed successfully
  • 3xx means some type of redirection
  • 4xx means a client-side error; e.g. 404 means resource not found
  • 5xx means a server-side error; e.g. 500 means unexpected server error

 

The content type parameter indicates what data is included in the message body. Each type of resource is indicated with a standardized string (Internet media type; aka MIME type). For example, some common response formats are

  • HTML file: text/html
  • XML file: application/xml or text/xml
  • CSV file: text/csv
  • JSON file: application/json
  • PNG image: image/png

 

Querying the databases

We have so far seen how the information is transmitted over the WWW. But how are databases accessed over the WWW? There are two main architectures used to access databases over the Net: SOAP and REST. Given its prevalence, we will center our discussion on the REST architecture.

Some major characteristics of the REST (REspresentational State Transfer) architecture are:

  • It is usually implemented over HTTP.
  • The queries are implemented as GET requests.
  • The communication is stateless. Any query must contain all required information. Nothing is saved in the server.
  • The query result is obtained as a response in a predefined format: XML, JSON, CSV, HTML, plain text...

 

Further reading

 

Understanding the responses... - Common structured data filetypes

Common formats in which a web API may provide its response are HTML, XML, CSV,  JSON and plain text. A short explanation of the first four types follows next. Feel free to skip it if you already know their characteristics.

 

HTML (HyperText Markup Language)

HTML is the standard language used to create web pages. When a web API outputs its result in HTML, this resource is usually intended to be viewed a browser. When no better option is available, HTML can be processed (parsed) to extract a specific information.

Likewise, any web page is published as HTML and so can be processed to extract any relevant information contained therein.

An example of HTML can be found here: http://kinetics.nist.gov/janaf/html/Cl-054.html

This the output of a search in the NIST-JANAF Thermodynamical Tables online database. The full HTML code can be read looking at the source of the page. A fragment is reproduced in Figure 1.

 

output of NIST-JANAF thermodynamic online database table 

Figure 1. Part of the HTML code of a page

 

More information on HTML can be found at http://www.w3schools.com/html/.

 

XML (eXtensible Markup Language)

XML is a markup language designed to contain structured data in format easy to be read by computers and for humans. An example of an XML response can be found running the following query: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1/property/MolecularFormula,MolecularWeight,InChIKey/XML. Figure 2 shows its response.

 

XML Response

Figure 2. An XML response

 

More information on XML can be found at http://www.w3schools.com/xml/.

 

CSV (Comma Separated Values)

CSV, for Comma Separated Values, is a plain-text file format use to store tabular data. In practice, CSV is not a unique format but different variations depending on the implementation and the regional setup of the operating system.

CSV main features:

  • Data is stored as plain text.
  • Records are separated by new-line characters.
  • Each record contains several fields. Columns (fields) are separated by a delimiting character (usually a comma, a semicolon or a tab).
  • Each record must have the same number of fields.

CSV main variations:

  • Any encoding may be used for the text file: ASCII, Latin-1, UTF-8...
  • Values may be quoted or not.
  • End-of-line character, delimiting character, quotes and escaping sequences may vary.
  • There may be header lines.

 

An example of a response in CSV is obtained when running the following query http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV.

The response is shown in Figure 3.

 

Fragment of CSV response

Figure 3. Fragment of a CSV response

 

More information can be obtained at https://en.wikipedia.org/wiki/Comma-separated_values.

 

JSON (JavaScript Object Notation)

JSON, for JavaScript Object Notation, is a format that encodes data object as human-readable text, organized as pairs attribute-value (name:value). Its main features are being a quite compact notation with easy conversion to Javascript objects (eval(), JSON.parse()).

An example query is https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/property/MolecularWeight/JSON. The response is shown in Figure 4.

 

Figure 4. A JSON response

 

More information can be found at http://www.json.org/.

 

Processing the responses... - Parsers and string functions

Ultimately the response will need to be processed to extract the desired information. This can be done using specific functions or parsers or through the use of some simple string manipulation functions.

Sometimes you want to parse XML or HTML using custom string functions for deleting whitespace, finding the position of certain characters within strings and for extracting substrings. These functions (plus many more) appear in all commonly used languages.

Table 1 summarizes some of these functions in the programming environments we will be discussing next week.

 

 

Table 1. Main string functions

Spreadsheet (Excel, Calc)

Basic (VBA, Libre Basic)

Javascript

SEARCH, FIND

InStr

search, indexOf

MID

Mid

substr, substring

SUBSTITUTE

Replace

replace

TRIM

Trim

trim

CONCATENATE

& (operator),

concat

 

Rating: 
0
No votes yet
Join the conversation.

Comments 23

Robert Belford's picture
Robert Belford | Wed, 02/22/2017 - 13:26

Hi All,

I think this is mostly directed at Jordi, but thought I'd post it here, so the discussion is public. The jist of this email is that I am pulling data from one table, but would like to pull from more than one.

I have been working on cleaning up a sheet for webscraping WikiPedia Chemboxes, and the following link should be public.

https://docs.google.com/spreadsheets/d/11iWH9L8T7Kyue3UYg4ZUiYZ47LlBeNZPWGfIlEFxyrk/edit#gid=1454316265

If you click cell [C1] a drop down box created by data validation allows you to choose a property from the Wikipedia Chembox, which is then uploaded through the following query
=QUERY(IMPORTHTML(B2,"table",1),"select Col2 where Col1='"&$D$1&"'",0)

Note, as I had to use a filter in [D1] to pull the value out of the validation box in [C1], and essentially said that if the dropdown box says "density", set D1 to "Density", and then Density was used in the query.

the IMPORTHTML command takes data from the first table, and for the compounds in red, the chembox is the second table (the first is stuff from the wikipedia editors.)

How can I get the above script to pull from the first or second table?

In can use this and it works by itself, but not in the query.
={IMPORTHTML(URL, "TABLE", 7) ; IMPORTHTML(URL,"TABLE",8) ; IMPORTHTML(URL,"TABLE",9) }

I can bring in two tables, but I can't get that to work in the query.

I did go in and change the index for glucose in [C20] from 1 to 2, and now it works, but that is only a short term fix, and what is really needed is the ability to pick from the first or second table, as we have no idea when Wikipedia editors will be placing their message/comment tables on pages.

Any help is appreciated, and I was thinking this might morph into a project. For example, I think I have a strategy that can pull from multiple infoboxes

Jennifer Muzyka | Wed, 02/22/2017 - 13:31

Bob,
This sounds really cool! Unfortunately the behavior of the Google Sheet is not the same as you've described. I don't see a dropdown box in cell C1. Maybe that's because I don't have full permission to edit the page. The URL you posted says edit in there, but the permission I see at the top of the page says I only have permission to view.
Jennifer

Robert Belford's picture
Robert Belford | Wed, 02/22/2017 - 13:48

OK, I think you can now edit. What I suggest/request, is that anyone who wants to edit make a duplicate of the sheet to that (tabs at the bottom), and play around, and don't worry about messing anything up.

I had an old version that used 2 columns for every chemical, which was much less elegant, but could take data from multiple tables in a page.

Cheers,
Bob

Cheers,
Bob

Jordi Cuadros's picture
Jordi Cuadros | Thu, 02/23/2017 - 04:19

is here: https://docs.google.com/spreadsheets/d/15WjPrkY16HC_XgAuQWJ78F0-l_bj7DP1-H9d6d7MG8o/copy

I would say it is far from good but it does solve some issues: it looks for the infobox and it is a bit more resilient to the different names used for the properties in different compound pages.

Hope it helps.
Best,

Jordi

Jordi Cuadros's picture
Jordi Cuadros | Thu, 02/23/2017 - 04:21

could be:
https://docs.google.com/spreadsheets/d/1Qw5UVxDSBiGKjxACs-PaUVjkZlZCCH1IM7brcfo9xS8/copy

I don't think I will be able to do more than this without scripting.
Best,

Jordi

Robert Belford's picture
Robert Belford | Fri, 02/24/2017 - 19:31

Hi All, I think I am addressing Jordi, although anyone is welcome to chime in.

I tweaked your spreadsheet smiles2spectra to name2spectra and it is here,
https://docs.google.com/spreadsheets/d/1EGfeOYLYWT-wM_cHMN8zxDT1WoltOnDSjWRfFg0B35g/copy

I have two questions,
First, in this code,
=MID(IMPORTXML("https://chem.nlm.nih.gov/chemidplus/name/" & B1,"//link[@rel='canonical']/@href"),40,60)
I do not understand what is happening after the "&B1&", and I want to make a video explaining what we are doing. Can you explain?

Second, in cell 24 I made a jab at getting the jdx. Any advice? I would like to download the spectral data as two columns, can we do that?
Cheers,
Bob

Robert Belford's picture
Robert Belford | Fri, 02/24/2017 - 19:36

Maybe I should mention to everyone, if you type a name into cell B1, it brings up an IR/UVVis/MS spectra (dictated by cell A5), and a jdx of the data.

Is there anyway we can lock some cells on a spreadsheet, while leaving others open? The ideal would be to lock all cells but B1. Can that be done?
Cheers,
Bob

Jordi Cuadros's picture
Jordi Cuadros | Sat, 02/25/2017 - 01:19

For those wondering the SMILES2spectra spreadsheet is here https://docs.google.com/spreadsheets/d/1iKEXWUk4JQQdueklu8-PFbg-b0GNHYC9hIvoywEOY4I/copy.

The expression that is the second argument of the IMPORTXML function is an XPath expression. XPath is a computer language to pull specific data from an XML expression. More information on XPath can be found at:
https://www.w3.org/TR/xpath/
https://www.w3schools.com/xml/xpath_intro.asp
https://www.tutorialspoint.com/xpath/index.htm

"//link[@rel='canonical']/@href" reads search for a "link" tag wherever in the page such as it has an attribute "rel" which equals "canonical" and pull the value of its "href" attribute.
For benzene, it will pull the value "https://chem.nlm.nih.gov/chemidplus/rn/71-43-2" from line 106 of view-source:https://chem.nlm.nih.gov/chemidplus/name/benzene.

Does it make sense?
Cheers,

Jordi

Jordi Cuadros's picture
Jordi Cuadros | Sat, 02/25/2017 - 02:27

This canbe done using the Data/Protected sheets and range...
I would suggest protecting the sheet Sheet1 but the ranges B1:D1 and B2, by using a warning on edit.
Cheers,
Jordi

Jordi Cuadros's picture
Jordi Cuadros | Sat, 02/25/2017 - 02:25

Take a look to an improved version of the SMILES2Spectra sheet here:
https://docs.google.com/spreadsheets/d/1DG9fTdSkDE_f2DqEIW0ib2HOlW6p-BQH0TMqZbJUYBE/copy

JCAMP-DX import is in Sheet3. The chart works only when the data is in XYDATA format (usually IR spectra). More work needs to be done to process peaks or points.

Jordi Cuadros's picture
Jordi Cuadros | Sat, 02/25/2017 - 07:31

it can be emulated from a VBA function.
Take a look here: https://drive.google.com/file/d/0B5ln9gFZnHD4a0RtWWJCdWpKclE/view

Does it help?
Cheers,
Jordi

Robert Belford's picture
Robert Belford | Tue, 03/14/2017 - 15:56

How would one load this page to a google sheet? https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/XML/?heading=Boiling+Point&response_type=display

I am assuming you place this in a cell (D52) https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/annotations/heading/XML/?heading=Boiling+Point

And then something like =IMPORTXML(D52"//XPath_query")
But I do not know the XPath_query, or how to understand the Xpath_query

Jordi Cuadros's picture
Jordi Cuadros | Tue, 03/14/2017 - 16:24
The information at this URL seems to be longer than what IMPORTXML can proccess. It won't be an XPath issue. For more info on XPath looks at the following links (repeated from a previous post): <a href="https://www.w3.org/TR/xpath/">https://www.w3.org/TR/xpath/</a> <a href="https://www.w3schools.com/xml/xpath_intro.asp">https://www.w3schools.com/xml/xpath_intro.asp</a> <a href="https://www.tutorialspoint.com/xpath/index.htm">https://www.tutorialspoint.com/xpath/index.htm</a> Best. Jordi

Robert Belford's picture
Robert Belford | Tue, 03/14/2017 - 16:33
so is there a way to upload that to a Google Sheet? Cheers (and thanks)

Jordi Cuadros's picture
Jordi Cuadros | Tue, 03/14/2017 - 16:50
I can't see any easy way to do it: the contents won't fit in a cell and when splitted apart it takes more cells than allowed in Goggle Sheets. Either some external processing or some coding will be required. Cheers, Jordi

Robert Belford's picture
Robert Belford | Fri, 03/24/2017 - 17:22
I have a list of 100,002 InChI keys, and I am trying to see if those chemicals are on Wikipedia. A student created the following spreadsheet in Google, <a href="https://docs.google.com/spreadsheets/d/1Kit4f7frOnz-2SWSM6KV9ULZjffm8QcfQJx09GWSz3M/copy">https://docs.google.com/spreadsheets/d/1Kit4f7frOnz-2SWSM6KV9ULZjffm8QcfQJx09GWSz3M/copy</a> I can get column "B" to work, but am dumbfounded on column "C". There may be a different way to do this, but our objective is to take a list of InChI Keys and determine how many are in Wikipedia, and then, how many of those have GHS codes in the chembox. Any ideas?

Robert Belford's picture
Robert Belford | Fri, 03/24/2017 - 17:36
I think there is another way. When you go to the URL with the InChI key, you get the page asking if you want to create the page, as wikipedia pages are not named after InChI keys. What we need, is the name of the wikipedia page, and that is embedded in the script here, <a href="/wiki/Maneb" title="Maneb" data-serp-pos="0">Maneb</a> I think the xpath stuff Jordi was showing us can do this, we just need to figure how to do it with Excel, instead of Google sheets.

Robert Belford's picture
Robert Belford | Fri, 03/24/2017 - 21:54
Hi Jordi, Thanks for sharing this. I have two questions, and please pardon my not having time to look up everything ab initio, but I did spend multiple hours on this today. First, on the code, =iferror(importXML(B2, "//a[@data-serp-pos='0']"),"InChi Key Not in Wikipedia") can you explain what is going on, especially the [@data-serp-pos='0] part. And second, we need to do this with Excel, as Google Docs said there was too much data. I will try and attach to this email the file I am working on, which I downloaded from PubChem, and contains chemicals with LCSS information. I am also curious, if you look at the raw text in the file, each InChI key ends with "N", and is followed by the PubChem number, as if it was one string. But when you open it as tab delineated, in Excel, it recognizes that there is a tab there, but why? Does anyone know? (Hopefully there will be a file attached to this comment) Cheers, Bob
Comment File: 

Jordi Cuadros's picture
Jordi Cuadros | Sat, 03/25/2017 - 01:40

"[@data-serp-pos='0']" means that the XML element has an attribute named data-serp-pos which value is '0'. I don't know what the attribute means but it seems to be related to the position of the link in the search results. Thus a value of '0' would mean the first result. Cheers, Jordi

Jordi Cuadros's picture
Jordi Cuadros | Sat, 03/25/2017 - 01:52

What you have between the InChIKey and the next CID is a line-feed character. This character is not rendered in some text editors (e.g. Windows Notepad) but it is in others (e.g WordPad). In Unix/Linux, it means the start of a new line and this is the way it is interpreted in Excel. In Windows, a newline is indicated with two characters, carriage-return + line-feed and this is way it doesn't show in notepad. More detailed information can be found at https://en.wikipedia.org/wiki/Newline. Cheers, Jordi

Annotations