Discussion

Ehren Bucholtz | Mon, 02/27/2017 - 14:24

Project Idea: Optical Structure Recognition

This is a project that I have been working on, and thought I could incorporate students into the project. Students in this course have varying levels of skills and abilities, depending on your skill levels, students might take on different aspects of this project if interested.

The project is based off of the OSRA program developed by Igor FIlippov at NCI.

https://cactus.nci.nih.gov/osra/

and Igor’s original goal was to create software that could read structures from published PDFs to extract the image data as SMILES strings.

and you can test it out here:

https://cactus.nci.nih.gov/cgi-bin/osra/index.cgi

you just upload a structure that you have drawn in a chemical drawing program that was subsequently saved as a gif or jpg.

I have contacted Igor, and in addition to some of my ideas, he has some great ideas that would be perfect for student projects.

  1. Using OSRA for hand written structures

I saw this as a way to remove the tedium of hand grading papers. I thought it would be great to have my 200 students answer a quiz question by drawing and submitting papers. I could have the software scan and see if they are correct. There seems to be evidence that handwriting improves learning. Students also draw structures when they do homework, but for large classes, we often test with multiple choice. Eventually, I would like to see if students have better learning outcomes if we stick to all drawing for example.

I have been testing this out with a subset of structures, and am finding out the issues with hand writing recognition within the program.

  1. Subproject 1: validation database

One of the key aspects for this part of the project would be to develop a validation set of molecules for testing.

The following is a link to a Google Sheet that has what I have in mind for this aspect of the project:

https://docs.google.com/spreadsheets/d/1Qrxa3vkWkyKGHGuAppm4lGe3UdTvz7g-OODtc8xu7Yc/edit?usp=sharing

The project would be to set up a list of alkenes, branched alkenes, cycloalkenes, alkynes, branched alkynes,alcohols, branched alcohols, alkyl halides (bromides, chlorides, fluorides), ketones, carboxylic acid derivatives (acid chlorides, esters, anhydrides, thioesters, amides), amines and thiols. You can see from the google doc above that stereochemistry can and should be included. One of the aspects would be to use InChI Key data for determining if a correct structure was drawn. For example, if you have 4-methyloctane, it can be racemic, R or S. The first 14 letters of the InChI key don’t change. So, in the python grading program that I have written, one can award partial credit for drawing correct connectivity, but incorrect stereochemistry. I know that OSRA can handle wedges and dashes, but how well and if hand written, I am unsure.

  1. Subproject 2: handwriting recognition

I have written a program in python that after a student paper is scanned in, it checks each structure and compares the SMILES string to a known SMILES string. I think using InChI might be better for this as indicated in part 1a. The aspect of this part of the project would be to draw the molecules in the validation set by hand and then scan them in and see determine the error rate for SMILES determination. Factors that I have found complicate this step include- eraser marks, quality of lines drawn, sloppy drawing. In this part of the project we would look at what makes a good structure for analysis (whether hand drawn or computer drawn).

  1. Using OSRA for evaluation of bridging compounds

OSRA has an algorithm for dealing with bicyclic or bridged compounds. The question is how good is it as handling them. I know that morphine is in the original validation data set, but when I have tried determining its SMILES from an image, it seems to get it wrong. This part of the project would be to use PubChem or Reaxys to identify bicyclic/bridged molecules (such as adamantane) and determine what is the success rate.

  1. OSRA for reaction recognition

Just like compounds, reactions can be identified by smiles (rsmi).  You can see an example and explanation of rsmiles here:

http://www.daylight.com/meetings/summerschool01/course/basics/smirks.html

In the last few versions of OSRA (1.4.0 and higher) this capability has been added, but when I spoke with Igor last week, he indicated that this aspect of the program has not been fully vetted.

Igor has suggested we put together a collection of a few hundred images from literature and/or patents. We would have to make a database of the images with starting material, reagents, and products. Then we would need to determine what the accepted rsmi should be for the reaction, and then compare the OSRA output to the known.

  1. OSRA for polymer recognition

Also added in the latest version 2.1.0, polymers have been added to the OSRA program. Igor also indicated that this aspect of the program also has not been fully vetted. Just like in project 3 above, we would have to put together a database of polymers and test to see that OSRA properly determines the output.  He suggested that we look at the latest version 1.05 of InChI. I am guessing we could try to see what they used as a validation set and we can use that as a starting point in OSRA.

  1. Comparison of OSRA to IMAGO

OSRA is not the only open source optical structure recognition program available. https://github.com/ggasoftware/imago Imago is also free. There is also a commercial software called CliDE http://www.keymodule.co.uk/products/clide/index.html it might be interesting to compare how OSRA does compared to these other packages.

Long story short- there are lots of projects that can be based around optical structure recognition. I also believe that some of these projects could result in publication if any students are looking to make posters for ACS meetings, or even a publication in J. Comp. Inform. Modelling for validation of OSRA, or J. Chem. Ed for the handwriting analysis.

Ehren Bucholtz | Mon, 02/27/2017 - 11:13

I have been working with a Google spreadsheet this morning. Trying to access OPSIN and Cactus. There are a couple things that I thought I would share.

1) It looks like the OPSIN implementation of SMILES is non canonical, and appears to give same result as openbabel. Therefore, I created a new column that takes the OPSIN smiles, and uses that as a new input in cactus at the chemical identifier web tool. That appears to give the Daylight or canonical smiles.

2) With 4 columns doing web access of data ( one each for opsin name to smiles, opsin name to inchikey, open structure image, and the Canonical SMILES from Opsin smiles) I am getting lots of "loading" and "#NA". Google sheets then gives me an error "Error: Loading data may take a while because of the large number of requests. Try to reduce the amount of IMPORTHTML, IMPORTDATA, IMPORTFEED or IMPORTXML functions across spreadsheets you've created."

Is anyone else getting really slow response times, or the same error when importing multiple data columns? One of the issues that I think we will have in the future is that the spreadsheet is building this data dynamically every time. Is there a way to have the Google sheet store the data once collected so that it doesn't have to repopulate each time? As it stands now, if I close the sheet and come back in, it has to reload all the data. It is suggested in the above information that maybe we should copy and paste the data once collected into a new spread sheet- a sort of two database system where one is dynamic and one is static. I like how easy it is to do this with Google Sheets, but I wonder if there is a way to do this with Excel so that I can tell the spreadsheet to refresh only when I tell it to. Any ideas?

Bob Hanson's picture
Bob Hanson | Mon, 02/27/2017 - 10:56

I would say scraping is hacking. To a certain extent it depends upon the site. Perhaps the authors intend for people to parse their page for information. But more likely they don't intend that, and they certainly do not feel obliged to keep their presentation format stable and never upgrade. Here are some reasons to avoid scraping wherever there is an API alternative:

-- might work today; no guarantee it will work tomorrow.
-- HTML pages are not designed to be used this way. They are designed with presentation in mind, not data retrieval.
-- APIs are designed for data retrieval.
-- APIs have publicly defined structure, which provides some aspect of stability.
-- HTML pages may appear differently on different platforms; APIs do not even have the sense of "platform"
-- When a web service provides an API, they intend for people to use it instead of scraping. You can contact a web service and ask them about their API, suggest improvements, point out issues. If you try that with their web page, and they have an API, they will just smile and say, "Please use our API." or maybe just dismiss you as a hack.
-- Consider the source of that blog. Do they have an agenda, perhaps?

So clearly I disagree with just about every point this blogger is saying.

OLCC s12's picture
OLCC s12 | Sun, 02/26/2017 - 20:54

I recently read an article while doing background reading on some topics mentioned in this module and it was titled "Web Scraping vs API" and it listed several reasons for why the author seems to think the web scraping is often a better way to get data.

With extracting chemical data and using a site that can be used by both methods or at least one where the html is organized nicely and it offers web APIs for most of the data, which method would you recommend if setting up scripts for pulling data for long term use?

The url for the article I read is below

https://www.grepsr.com/web-scraping-vs-api/

Stuart Chalk's picture
Stuart Chalk | Sun, 02/26/2017 - 05:05

I have replaced the 'bad link' with one that is good for now. I also checked the WayBack Machine (http://archive.org/web/) for a copy of the page but its not there because the page was identified as protected by the institution.

Sorry, I did not pick this up earlier.

Bob Hanson's picture
Bob Hanson | Sat, 02/25/2017 - 17:35

Well, there's a lesson for you! :)

That professional has moved on, and their site has been cleared. Alas! This is the future for all web pages.

It was just an example, though. Best to ignore it.

olcc s16 | Sat, 02/25/2017 - 15:40

The link for Thermodynamic Properties of Pure Substances in the first assignment seems not working. I keep having Page not Found report. Thanks

Phuc

Jordi Cuadros's picture
Jordi Cuadros | Sat, 02/25/2017 - 07:31

it can be emulated from a VBA function.
Take a look here: https://drive.google.com/file/d/0B5ln9gFZnHD4a0RtWWJCdWpKclE/view

Does it help?
Cheers,
Jordi