8.4 Other relevant source for data: Other web APIs & web scrapping

Many other data sources are available on the Net. Some of them have APIs to access them programmatically. Some of them can be found in the following directories:



Wikipedia API

In this module, we will only explore the Wikipedia API, both as an example and as a huge multilingual information source. As you know, Wikipedia is the largest encyclopedia on the Internet. It is free-to-access and is edited by its users.

In terms of processing, we must be aware that it contains semi-structured data which makes it somewhat harder to work with.

Elements of Wikipedia of major interest for a chemist:

●Substance description
●Substance name translations
●Chemboxes (https://en.wikipedia.org/wiki/Template:Chembox)
●Data pages for specific substances: “Properties of” pages, like https://en.wikipedia.org/wiki/Properties_of_water and supplementary data pages: https://en.wikipedia.org/wiki/Aluminium_chloride_%28data_page%29
●Data tables, for example https://en.wikipedia.org/wiki/List_of_elements, https://en.wikipedia.org/wiki/Solubility_table or https://en.wikipedia.org/wiki/Dictionary_of_chemical_formulas


Wikipedia contents can be accessed through the MediaWiki, the wiki engine behind Wikipedia and other projects, API. Its documentation can be accessed at https://www.mediawiki.org/wiki/API:Main_page, https://www.mediawiki.org/wiki/API:Tutorial and https://en.wikipedia.org/w/api.php?action=help&recursivesubmodules=1. A more RESTful API is in development, https://www.mediawiki.org/api/rest_v1/?doc.

The API entry point for each localized Wikipedia is https://LANGUAGE_CODE. wikipedia.org/w/api.php. The language codes for each wikipedia can be looked up at https://www.wikipedia.org/.

The major actions (action=...)  to extract data from the MediaWiki API are

  • query: to access the data of a page or to look up something in the wikipedia. The major parameters to be set are, prop=revisions, rvprop=content, titles=... and rvcontentformat=...
  • parse: to get the processed version of the page. Here the major parameters are prop=text, page=... and contentformat=...

In any case, the response will need to be processed to extract the desired information.

Some example calls are: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=methane&rvprop=content&rvcontentformat=text/x-wiki&format=xml, https://en.wikipedia.org/w/api.php?action=parse&page=ethane%20(data%20page)&prop=text&format=txt.

Other pages at MediaWiki can also accept parameters sent via GET or POST methods. The most important page is index.php. More information can be found at https://en.wikipedia.org/wiki/Help:URL and https://www.mediawiki.org/wiki/Manual:Parameters_to_index.php.


Web scraping

Some information on the web can only be accessed via HTML, that is as a browser does. Even in these cases, the HTML file can be parsed/processed to extract any relevant  information. This is called web scraping or also screen scraping.

When making a limited number of accesses, it can be assumed similar to looking up the information with a browser, this does not hold when making a more intensive use of a site. In this later case, one must be aware of the permitted uses of the web site and licenses on the informations you are scraping.

Some example websites with chemical information that could be scraped are listed below:

●Kaye and Laby Online, http://www.kayelaby.npl.co.uk/
●Common Chemistry, http://commonchemistry.org
●MatWeb, http://www.matweb.com
●Solv-DB, http://solvdb.ncms.org/index.html
●eMolecules, https://www.emolecules.com/
●ChemSynthesis, http://www.chemsynthesis.com
●NIST Chemistry Webbook, http://webbook.nist.gov/chemistry/
●Other NIST free access databases, http://srdata.nist.gov/gateway/gateway?dblist=0
●Chemexper, http://www.chemexper.com
●Open Notebook Science, http://onswebservices.wikispaces.com/home
●DrugBank, http://www.drugbank.ca/
●ChemWiki, http://chemwiki.ucdavis.edu
●FDA Substance Registration System, http://fdasis.nlm.nih.gov/srs/srs.jsp
●EMD Millipore Catalog, http://www.emdmillipore.com
●Sigma Aldrich Catalog, http://www.sigmaaldrich.com/
●Alfa Caesar Catalog, https://www.alfa.com/es/catalog/category/chemicals/
●Iowa State University Chemistry Material Safety Data Sheets, http://avogadro.chem.iastate.edu/MSDS/homepage.html


Many more resources are there that can be processed for almost any information that can be needed, including Google (http://www.google.com). In fact, any website that is either static or accepts GET parameters can be used programmatically with the same procedures we will use for web APIs.

No votes yet