One common way to offer data to be used programmatically is through a web API. This learning object will be devoted to discuss how you access and use a web API.
An API (Application Programming Interface) is the set of elements that a programming library or service makes available to other programmers to be used remotely to access their services and data.
This API is called a web API when these functionalities are delivered via the HTTP protocol through the Internet.
In order to be able to get information from an online database, we will need to:
1.Send a query to the database through the Internet (HTTP protocol)
2.Understand the format in which the response will be received
3.Process this response to extract the desired information
HTTP stands for HyperText Transfer Protocol and it comprises the set of rules that control the transfer of a resource between a web server and a web client (usually a browser).
The basic elements of an HTTP transaction are:
- A request message
- A response message (which is usually the resource being transferred)
The request message consists in some text data sent to an URL (Uniform Resource Locator). An URL is a way of specifying an address in the Internet.
The general structure for an URL consists of
although in HTTP this commonly simplified to
http://domain/path or https://domain/path
The request message, which is usually handled by the HTTP client, has the following structure
- A first line which includes method, path and HTTP protocol version, usually
METHOD /path HTTP/x.x
- Several header lines
- A blank line
- And an optional message body
Although other methods exist, the most common request methods are GET and POST.
In the GET method, any additional information required to specify the request is included in the URL (which means in the path included in the first line of the message). The parameters are included after the path and a question mark as pairs name=value. Parameters are separated by the ampersand symbol (&)
In some cases, as we will see in PubChem API, the parameters are included in the path. This technique is called URL rewriting and shortens the URLs and makes it more usable.
In POST methods, any additional information is not included in the URL but in the body of the message and is usually submitted via a web form. We won’t go further into this method since most of the chemistry APIs can be queried via GET calls.
To end this brief introduction to HTTP, note that some characters are either disallowed or have special meaning in URLs. When these characters appear they will need to be encoded; for example a space is encoded as %20, and a sharp (#) is %23. A reference can be found at http://www.w3schools.com/tags/ref_urlencode.asp.
As we have seen for the request, the response contains:
- A first line
- Several header lines
- A blank line
- And a message body which is the transferred resource
Two aspects of the response to be highlighted are the status code (included in the first line) and the content-type information (which is one of the headers).
HTTP status codes are indicated as a three-digit number
- 200 means transaction completed successfully
- 3xx means some type of redirection
- 4xx means a client-side error; e.g. 404 means resource not found
- 5xx means a server-side error; e.g. 500 means unexpected server error
The content type parameter indicates what data is included in the message body. Each type of resource is indicated with a standardized string (Internet media type; aka MIME type). For example, some common response formats are
- HTML file: text/html
- XML file: application/xml or text/xml
- CSV file: text/csv
- JSON file: application/json
- PNG image: image/png
We have so far seen how the information is transmitted over the WWW. But how are databases accessed over the WWW? There are two main architectures used to access databases over the Net: SOAP and REST. Given its prevalence, we will center our discussion on the REST architecture.
Some major characteristics of the REST (REspresentational State Transfer) architecture are:
- It is usually implemented over HTTP.
- The queries are implemented as GET requests.
- The communication is stateless. Any query must contain all required information. Nothing is saved in the server.
- The query result is obtained as a response in a predefined format: XML, JSON, CSV, HTML, plain text...
Common formats in which a web API may provide its response are HTML, XML, CSV, JSON and plain text. A short explanation of the first four types follows next. Feel free to skip it if you already know their characteristics.
HTML is the standard language used to create web pages. When a web API outputs its result in HTML, this resource is usually intended to be viewed a browser. When no better option is available, HTML can be processed (parsed) to extract a specific information.
Likewise, any web page is published as HTML and so can be processed to extract any relevant information contained therein.
An example of HTML can be found here: http://kinetics.nist.gov/janaf/html/Cl-054.html
This the output of a search in the NIST-JANAF Thermodynamical Tables online database. The full HTML code can be read looking at the source of the page. A fragment is reproduced in Figure 1.
Figure 1. Part of the HTML code of a page
More information on HTML can be found at http://www.w3schools.com/html/.
XML is a markup language designed to contain structured data in format easy to be read by computers and for humans. An example of an XML response can be found running the following query: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1/property/MolecularFormula,MolecularWeight,InChIKey/XML. Figure 2 shows its response.
Figure 2. An XML response
More information on XML can be found at http://www.w3schools.com/xml/.
CSV, for Comma Separated Values, is a plain-text file format use to store tabular data. In practice, CSV is not a unique format but different variations depending on the implementation and the regional setup of the operating system.
CSV main features:
- Data is stored as plain text.
- Records are separated by new-line characters.
- Each record contains several fields. Columns (fields) are separated by a delimiting character (usually a comma, a semicolon or a tab).
- Each record must have the same number of fields.
CSV main variations:
- Any encoding may be used for the text file: ASCII, Latin-1, UTF-8...
- Values may be quoted or not.
- End-of-line character, delimiting character, quotes and escaping sequences may vary.
- There may be header lines.
An example of a response in CSV is obtained when running the following query http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1000,1001/assaysummary/CSV.
The response is shown in Figure 3.
Figure 3. Fragment of a CSV response
More information can be obtained at https://en.wikipedia.org/wiki/Comma-separated_values.
An example query is https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/property/MolecularWeight/JSON. The response is shown in Figure 4.
Figure 4. A JSON response
More information can be found at http://www.json.org/.
Ultimately the response will need to be processed to extract the desired information. This can be done using specific functions or parsers or through the use of some simple string manipulation functions.
Sometimes you want to parse XML or HTML using custom string functions for deleting whitespace, finding the position of certain characters within strings and for extracting substrings. These functions (plus many more) appear in all commonly used languages.
Table 1 summarizes some of these functions in the programming environments we will be discussing next week.
Table 1. Main string functions
Spreadsheet (Excel, Calc)
Basic (VBA, Libre Basic)