12. Optical Structure Recognition

This is a project that I have been working on, and thought I could incorporate students into the project. Students in this course have varying levels of skills and abilities, depending on your skill levels, students might take on different aspects of this project if interested.

The project is based off of the OSRA program developed by Igor FIlippov at NCI.
https://cactus.nci.nih.gov/osra/

and Igor’s original goal was to create software that could read structures from published PDFs to extract the image data as SMILES strings.

and you can test it out here:
https://cactus.nci.nih.gov/cgi-bin/osra/index.cgi

you just upload a structure that you have drawn in a chemical drawing program that was subsequently saved as a gif or jpg.

I have contacted Igor, and in addition to some of my ideas, he has some great ideas that would be perfect for student projects.

  1. Using OSRA for hand written structures

I saw this as a way to remove the tedium of hand grading papers. I thought it would be great to have my 200 students answer a quiz question by drawing and submitting papers. I could have the software scan and see if they are correct. There seems to be evidence that handwriting improves learning. Students also draw structures when they do homework, but for large classes, we often test with multiple choice. Eventually, I would like to see if students have better learning outcomes if we stick to all drawing for example.

I have been testing this out with a subset of structures, and am finding out the issues with hand writing recognition within the program.

  1. Subproject 1: validation database

One of the key aspects for this part of the project would be to develop a validation set of molecules for testing.

The following is a link to a Google Sheet that has what I have in mind for this aspect of the project:

https://docs.google.com/spreadsheets/d/1Qrxa3vkWkyKGHGuAppm4lGe3UdTvz7g-OODtc8xu7Yc/edit?usp=sharing

The project would be to set up a list of alkenes, branched alkenes, cycloalkenes, alkynes, branched alkynes,alcohols, branched alcohols, alkyl halides (bromides, chlorides, fluorides), ketones, carboxylic acid derivatives (acid chlorides, esters, anhydrides, thioesters, amides), amines and thiols. You can see from the google doc above that stereochemistry can and should be included. One of the aspects would be to use InChI Key data for determining if a correct structure was drawn. For example, if you have 4-methyloctane, it can be racemic, R or S. The first 14 letters of the InChI key don’t change. So, in the python grading program that I have written, one can award partial credit for drawing correct connectivity, but incorrect stereochemistry. I know that OSRA can handle wedges and dashes, but how well and if hand written, I am unsure.

  1. Subproject 2: handwriting recognition

I have written a program in python that after a student paper is scanned in, it checks each structure and compares the SMILES string to a known SMILES string. I think using InChI might be better for this as indicated in part 1a. The aspect of this part of the project would be to draw the molecules in the validation set by hand and then scan them in and see determine the error rate for SMILES determination. Factors that I have found complicate this step include- eraser marks, quality of lines drawn, sloppy drawing. In this part of the project we would look at what makes a good structure for analysis (whether hand drawn or computer drawn).

  1. Using OSRA for evaluation of bridging compounds

OSRA has an algorithm for dealing with bicyclic or bridged compounds. The question is how good is it as handling them. I know that morphine is in the original validation data set, but when I have tried determining its SMILES from an image, it seems to get it wrong. This part of the project would be to use PubChem or Reaxys to identify bicyclic/bridged molecules (such as adamantane) and determine what is the success rate.

  1. OSRA for reaction recognition

Just like compounds, reactions can be identified by smiles (rsmi).  You can see an example and explanation of rsmiles here:

http://www.daylight.com/meetings/summerschool01/course/basics/smirks.html

In the last few versions of OSRA (1.4.0 and higher) this capability has been added, but when I spoke with Igor last week, he indicated that this aspect of the program has not been fully vetted.

Igor has suggested we put together a collection of a few hundred images from literature and/or patents. We would have to make a database of the images with starting material, reagents, and products. Then we would need to determine what the accepted rsmi should be for the reaction, and then compare the OSRA output to the known.

  1. OSRA for polymer recognition

Also added in the latest version 2.1.0, polymers have been added to the OSRA program. Igor also indicated that this aspect of the program also has not been fully vetted. Just like in project 3 above, we would have to put together a database of polymers and test to see that OSRA properly determines the output.  He suggested that we look at the latest version 1.05 of InChI. I am guessing we could try to see what they used as a validation set and we can use that as a starting point in OSRA.

  1. Comparison of OSRA to IMAGO

OSRA is not the only open source optical structure recognition program available. https://github.com/ggasoftware/imago Imago is also free. There is also a commercial software called CliDE http://www.keymodule.co.uk/products/clide/index.html it might be interesting to compare how OSRA does compared to these other packages.

Long story short- there are lots of projects that can be based around optical structure recognition. I also believe that some of these projects could result in publication if any students are looking to make posters for ACS meetings, or even a publication in J. Comp. Inform. Modelling for validation of OSRA, or J. Chem. Ed for the handwriting analysis.

Rating: 
0
No votes yet
Join the conversation.

Comments 3

OLCC S51 | Thu, 03/09/2017 - 11:15

Hello,

I am greatly interested in working on this project this semester! I find it exciting to be able to connect a program such as this to a database of molecules. I believe this project has a lot of potential to help students and instructors of chemistry in general. I am quite curious to see if we could possibly optimize this program to help read structures more efficiently. I am looking forward to helping out!

Jeremy

OLCC S53 | Thu, 03/09/2017 - 18:57

Hello,

I would like to participate in this project on using a program to read written input and accurately convert to data that computers can use and output results.  I have a background in computer science and this would interest me to figure out how to design a program that would be able to read the images and convert perfectly.

Chandler.

OLCC S52 | Thu, 03/09/2017 - 21:53

Hello everyone!  

Along with Chandler and Jeremy, I would love to be apart of this project.  I do not have much knowledge on programming but always have had a passion on furthering it because I truely believe it is our future.  Along with this project including concepts that we would be able to use in an everyday aspect, such as the SMILES and incorporating chemical data with that just motivates me even more to get on board and see where, as a group, we can bring this project.  

Matea

Annotations