I have been working with a Google spreadsheet this morning. Trying to access OPSIN and Cactus. There are a couple things that I thought I would share.
1) It looks like the OPSIN implementation of SMILES is non canonical, and appears to give same result as openbabel. Therefore, I created a new column that takes the OPSIN smiles, and uses that as a new input in cactus at the chemical identifier web tool. That appears to give the Daylight or canonical smiles.
2) With 4 columns doing web access of data ( one each for opsin name to smiles, opsin name to inchikey, open structure image, and the Canonical SMILES from Opsin smiles) I am getting lots of "loading" and "#NA". Google sheets then gives me an error "Error: Loading data may take a while because of the large number of requests. Try to reduce the amount of IMPORTHTML, IMPORTDATA, IMPORTFEED or IMPORTXML functions across spreadsheets you've created."
Is anyone else getting really slow response times, or the same error when importing multiple data columns? One of the issues that I think we will have in the future is that the spreadsheet is building this data dynamically every time. Is there a way to have the Google sheet store the data once collected so that it doesn't have to repopulate each time? As it stands now, if I close the sheet and come back in, it has to reload all the data. It is suggested in the above information that maybe we should copy and paste the data once collected into a new spread sheet- a sort of two database system where one is dynamic and one is static. I like how easy it is to do this with Google Sheets, but I wonder if there is a way to do this with Excel so that I can tell the spreadsheet to refresh only when I tell it to. Any ideas?
Project Idea: Optical Structure Recognition
This is a project that I have been working on, and thought I could incorporate students into the project. Students in this course have varying levels of skills and abilities, depending on your skill levels, students might take on different aspects of this project if interested.
The project is based off of the OSRA program developed by Igor FIlippov at NCI.
https://cactus.nci.nih.gov/osra/
and Igor’s original goal was to create software that could read structures from published PDFs to extract the image data as SMILES strings.
and you can test it out here:
https://cactus.nci.nih.gov/cgi-bin/osra/index.cgi
you just upload a structure that you have drawn in a chemical drawing program that was subsequently saved as a gif or jpg.
I have contacted Igor, and in addition to some of my ideas, he has some great ideas that would be perfect for student projects.
I saw this as a way to remove the tedium of hand grading papers. I thought it would be great to have my 200 students answer a quiz question by drawing and submitting papers. I could have the software scan and see if they are correct. There seems to be evidence that handwriting improves learning. Students also draw structures when they do homework, but for large classes, we often test with multiple choice. Eventually, I would like to see if students have better learning outcomes if we stick to all drawing for example.
I have been testing this out with a subset of structures, and am finding out the issues with hand writing recognition within the program.
One of the key aspects for this part of the project would be to develop a validation set of molecules for testing.
The following is a link to a Google Sheet that has what I have in mind for this aspect of the project:
https://docs.google.com/spreadsheets/d/1Qrxa3vkWkyKGHGuAppm4lGe3UdTvz7g-OODtc8xu7Yc/edit?usp=sharing
The project would be to set up a list of alkenes, branched alkenes, cycloalkenes, alkynes, branched alkynes,alcohols, branched alcohols, alkyl halides (bromides, chlorides, fluorides), ketones, carboxylic acid derivatives (acid chlorides, esters, anhydrides, thioesters, amides), amines and thiols. You can see from the google doc above that stereochemistry can and should be included. One of the aspects would be to use InChI Key data for determining if a correct structure was drawn. For example, if you have 4-methyloctane, it can be racemic, R or S. The first 14 letters of the InChI key don’t change. So, in the python grading program that I have written, one can award partial credit for drawing correct connectivity, but incorrect stereochemistry. I know that OSRA can handle wedges and dashes, but how well and if hand written, I am unsure.
I have written a program in python that after a student paper is scanned in, it checks each structure and compares the SMILES string to a known SMILES string. I think using InChI might be better for this as indicated in part 1a. The aspect of this part of the project would be to draw the molecules in the validation set by hand and then scan them in and see determine the error rate for SMILES determination. Factors that I have found complicate this step include- eraser marks, quality of lines drawn, sloppy drawing. In this part of the project we would look at what makes a good structure for analysis (whether hand drawn or computer drawn).
OSRA has an algorithm for dealing with bicyclic or bridged compounds. The question is how good is it as handling them. I know that morphine is in the original validation data set, but when I have tried determining its SMILES from an image, it seems to get it wrong. This part of the project would be to use PubChem or Reaxys to identify bicyclic/bridged molecules (such as adamantane) and determine what is the success rate.
Just like compounds, reactions can be identified by smiles (rsmi). You can see an example and explanation of rsmiles here:
http://www.daylight.com/meetings/summerschool01/course/basics/smirks.html
In the last few versions of OSRA (1.4.0 and higher) this capability has been added, but when I spoke with Igor last week, he indicated that this aspect of the program has not been fully vetted.
Igor has suggested we put together a collection of a few hundred images from literature and/or patents. We would have to make a database of the images with starting material, reagents, and products. Then we would need to determine what the accepted rsmi should be for the reaction, and then compare the OSRA output to the known.
Also added in the latest version 2.1.0, polymers have been added to the OSRA program. Igor also indicated that this aspect of the program also has not been fully vetted. Just like in project 3 above, we would have to put together a database of polymers and test to see that OSRA properly determines the output. He suggested that we look at the latest version 1.05 of InChI. I am guessing we could try to see what they used as a validation set and we can use that as a starting point in OSRA.
OSRA is not the only open source optical structure recognition program available. https://github.com/ggasoftware/imago Imago is also free. There is also a commercial software called CliDE http://www.keymodule.co.uk/products/clide/index.html it might be interesting to compare how OSRA does compared to these other packages.
Long story short- there are lots of projects that can be based around optical structure recognition. I also believe that some of these projects could result in publication if any students are looking to make posters for ACS meetings, or even a publication in J. Comp. Inform. Modelling for validation of OSRA, or J. Chem. Ed for the handwriting analysis.