OLCC Project 6: Solubility database

Final ACS Abstract: This project was designed to create an open solubility resource for the larger organic chemistry community. This project seeks to allow for solubility information to become more easily accessible, and provide information such as which compounds are soluble in which solvents, and to what degree. The information required for this project will be procured by combining data on solubility and solvents from publicly available databases and recent literature. In addition to providing a large open dataset of solubility values, the project seeks to create a webservice that can report measured and predicted solubilities directly from structure.


Nov 12:   Belford asked Davis to contact Lang with following: 
Dr. Lang,

I'm trying to get a project outline finalized for the Cheminformatics class.  I've looked at this data: http://figshare.com/articles/Open_Notebook_Science_Challenge_Solubility_Dataset/1514952 .
I see columns N-AF are mostly empty; is there something there I could add to the data?
and found it very interesting.  Is there something I can add to this research.  
I have about 2 weeks before I have to present my project.  I understand that this is a very short time period, but what in these areas would you suggest would be a feasible project for me?  Another student at Centre college is grouped with me and is going to continue the project in the spring to present at ACS.  
I will greatly appreciate any input you provide.
Nov. 13:  Lang replied:
Hi Brandon,
A quick project would be to analyse the names and the structures to see if they match? Then you could give examples and statistics on the accuracy of the database. Why are the structures wrong? Missing stereochemistry? etc.
Then you would correct the structures (all the names are correct - well at least they are the names according to the sources from which the solubility data was extracted) and update the other identifiers: CSID (column AA), etc to agree with the name column and then republishing as an updated dataset to figshare (which will now include you as a co-author). I think this would be a good two-week project.
Furthering this would be to add the new data I (and Bill Acree - UNT) have been collecting for about a year that needs converting to molar units and inputting. Then working working on a open solubility model (using the CDK) and then publishing this new dataset with the model in ChemistryCentral journal or other appropriate OA journal.
The columns you mentioned are not necessarily needed anymore, column AA is needed CSID is the Chemspider ID. Adding an additional column with the PubChem ID would be useful for the community. 
How does that sound? 
Nov 15:  Belford sent everyone following:
The 5 of us went in on the abstract that Perry is going to present, and I had created a discussion page, and would like everyone to subscribe themselves to it, so we can use this, instead of emails, to discuss the project.  Here is the direct link:
There have been some discussions going over email, and I am going to take the liberty to post to the body of the above page their content.  One of these deals with curating solubility data that has been uploaded to figshare.
As we included predicting solubilities in the abstract I am having Brandon give a presentation to my lecture class on the Abraham's equation, and we can share his powerpoint presentation on the above site.
Now, with regard to the data curation, I have an idea, which I would like to discuss on the above web page, but I need you all to subscribe.  With Jennifer and Perry's approval, this may be something that would be appropriate for Perry.  
One of the things we did with the WikiHyperGlossary was take an IUPAC glossary, you NIH and ChemSPider services to generate an InChI, and figure if they generated the same InChI, the word was a chemical.  (We then associated the InChI with the "word" and used that for various software tasks).  That is, we needed to know which words were chemicals, and which were not.
Now, we published this work here,
Additional file 1 (http://youtu.be/ptX9tIrqcEE. )shows why we did this,
But Additional file 4
describes this process using additional 5 as a spreadsheet.
Now, I have not looked at figshare (I am very busy and still have to upload the last part of module 8), but I understand we need to make sure the sturcutres and names match.  So my question is, can we do something like we did with the WikiHyperGlossary, where we convert the names to InChI, and the Stuructures to InChI, and then match the InChI (or more accurately, flag when they do not match).
OK, Please subscribe to the project page.  Also, please realize that there are two, potentially 3 projects going on.  That is, both Brandon and Perry are doing their own project for this class, which will be over withing 4 weeks, and we are all doing a project that will be presented at the ACS meeting.  Hopefully we can synch Brandon and Perry's class projects for their respective schools with the overall project.
So I thought if Brandon tackled the Abraham's eq., and maybe Perry tackle a script using web APIs to convert structures and names to InChI and compare them.  He can possibly use the spreadsheet in file 5 of the supporting documents as a template.
If so, we probably need to acknowledge Andrew Cornell, as he spent quite a bit of time on the project.
No votes yet
Join the conversation.

Comments 13

Jennifer Muzyka | Thu, 10/29/2015 - 08:11
I just learned about an openly available solubility database that should be helpful for this project. Here's the URL: <a href="http://srdata.nist.gov/solubility/">http://srdata.nist.gov/solubility/</a>

Brandon Davis (not verified) | Thu, 10/29/2015 - 16:53
This project is designed to assist forensic chemists, who are required to analyze very dilute samples. Sometimes those tedious extractions can be avoided by exploiting the differing solubilities of compounds being extracted in different solvents. This solubility information is not very easy to locate and is not always trustworthy. This project will allow for solubility information to be more easily accessible and provide information such as which compounds are soluble in which solvents and to what degree. The information required for this project will be procured by combining data on solubility and solvents from publicly available databases. The objective is to create a spreadsheet that connects solutes to online databases and assists the user in finding appropriate solvents. If at all possible, we would like to enable green chemistry metrics within this function.

Perry Sharma (not verified) | Wed, 12/02/2015 - 23:14
Hello everyone, I just wanted to clarify my current assignment for this project. If I am correct, I am supposed to create a script that uses web APIs to convert structures and names to InChI and compare them. Where am I supposed to get the structures and names? Are you referring to the figshare solubility dataset? If so, what is the best way to access the script editor for that excel file? Thank You Perry Sharma

Robert Belford's picture
Robert Belford | Thu, 12/03/2015 - 08:07
Hi Perry, I do not think that is your project, as it has already been done. If you look above (into the body of this discussion), I link to a youtube that is supporting documents to a paper in JChemInf, where we did that. Now there may be something simliar, which is to scrape solubility data from various resources and compare it. But I think we all need some discussion on this. I suggest you look at the YouTube, and there is also an MS word document that describes the process, and an Excel spreadsheet that has the functions in it. You should be able to download the latter two from the JChemInf site (it is open access). But thanks for contacting us, and I am sure we will come up with something worth presenting at the ACS meeting. Cheers, Bob

Perry Sharma (not verified) | Thu, 12/03/2015 - 23:24
Hello, Dr. Belford I took a look at the MS word document, but for some reason the excel file is corrupt. I tried to look for the file on the JChemInf website but I could not find it. Is there anyway you could post a working link? Thank You

Jennifer Muzyka | Fri, 12/04/2015 - 07:13
I had the same trouble as Perry with the Excel file. I thought it might be that I'm a Mac user. But Perry's not a Mac user. Here's the link to the JChemInf article, copied from above: <a href="http://www.jcheminf.com/content/7/1/22">http://www.jcheminf.com/content/7/1/22</a>

Robert Belford's picture
Robert Belford | Fri, 12/04/2015 - 07:43
I too am now having this problem, which is interesting, as it worked when the article was reviewed. I will look into this. But also, what we did was several years old, and there are different ways to do it now. That is, we wrote a script to interact with the API, and now there are functions in Excel, Libre Office and Google Sheets that do what the script did. You should still be able to watch the video, and there may be a message here. Which is, make videos, so that when something is degraded or no longer supported, you can still show what you did. What bothers me is that the sheet is now corrupt, and that makes no sense (I can see if an API changed that it would not work, but why become corrupt?)

Jennifer Muzyka | Fri, 12/04/2015 - 07:18
Yesterday I ran across a Kindle book with solubility data. I paid 99 cents for it. But it turns out you can get the book for free from the Internet Archive. And it turns out there is a second volume. You can get to both volumes on the Internet Archive (<a href="http://archive.org">http://archive.org</a>) with the title Solubilities of inorganic and organic compounds; a compilation of quantitative solubility data from the periodical literature. But maybe we already have more data that we could hope to use in this project.

Perry Sharma (not verified) | Thu, 02/11/2016 - 17:57
02/09/16 Hello, everyone It has been a little while since any discussions on this project have taken place and now that I am back I think we can resume discussions. First, I would like to mention that I wrote a paper about this project last semester in my cheminformatics class and I will upload the document soon (It should be at the very bottom). The paper explains the project and it gives some information about what I've accomplished so far on this project (which isn't much). I had a skype session today with Dr. Belford as well as Dr. Muzyka and we had some interesting discussions about what direction this project is taking. Dr. Belford suggested that a critical part of this project is that we add tools to the solubility spreadsheet that allows users to enter solubility values. The problem is that different solubility databases will have solubility values in different units and so it is important that we find a web API for unit converter that will allow us to convert the values so that they all have the same units, making comparisions easier. I hope I am on the right track explaining the problem. If not, Dr. Belford please feel free to add. Additionaly, a third dimension can also be added to the problem if we realize that the different solubility databases will have different solubility values depending on which solvent was used to determine the value. For example, you will notice that some databases will have the solubility values based on solubility in water. Other databases will base there solubility values on other solvents. Although this is a problem, I think it is secondary to the unit problem that we will encounter. I have found a couple of databases on the internet already (links are below) and you will notice that all three of the databases have different units for solubility. The goal is to process these values so that they all have the same units. Please feel free to provide ideas on how to do this and Dr. Belford, if I am missing any part of the problem please add so we can all be on the same page. Also, if you happen to stumble upon some solubility databases, please add them on this page. Links: <a href="http://www.chem.wisc.edu/deptfiles/genchem/sstutorial/Text11/Tx112/tx112.html">http://www.chem.wisc.edu/deptfiles/genchem/sstutorial/Text11/Tx112/tx112.html</a> <a href="https://www.organicdivision.org/orig/organic_solvents.html">https://www.organicdivision.org/orig/organic_solvents.html</a> <a href="http://cool.conservation-us.org/coolaic/sg/bpg/annual/v03/bp03-04.html">http://cool.conservation-us.org/coolaic/sg/bpg/annual/v03/bp03-04.html</a> Thank You Parijat Sharma

Andrew Lang's picture
Andrew Lang | Mon, 02/15/2016 - 09:18
Here's this solubility database: <a href="https://figshare.com/articles/Open_Notebook_Science_Challenge_Solubility_Dataset/1514952">https://figshare.com/articles/Open_Notebook_Science_Challenge_Solubility_Dataset/1514952</a> See here for scripts and webservices to convert between solubility units x,M, etc.: <a href="http://onswebservices.wikispaces.com/solubility">http://onswebservices.wikispaces.com/solubility</a>

Perry Sharma (not verified) | Tue, 02/23/2016 - 11:04
Hello, Dr. Lang Thank you so much for providing the scripts for the solubility calculations. So far I have only been able to access one of the scripts and its the one that converts mole fraction to molarity. What I am doing is I am translating the PHP script and then inserting the code into the google sheets code to scrape for variables such as solute density, solute molecular weight, etc. I've had success obtaining the molecular weight from Cactus using Google scripts but I am unable to obtain the densities from Cactus. It that because Cactus does not provide density data? Is there a Chemspider URL API scheme that I can use to obtain density from Chemspider instead? Also, I was wondering if you would like me to replicate the PHP scripts in Javascript and then use them in the Google sheets or would you like me to use the Javascript in the google sheets to directly access the PHP scripts you provided? Lastly, I was wondering if the rest of the calculation scripts can be accessible? Because so far I have only been able to access one. Thank You Perry Sharma

Andrew Lang's picture
Andrew Lang | Tue, 02/23/2016 - 11:41
Hi Perry, I can help you. Please contact me at alang<at>oru.edu