I found a free web site with apps that convert files into and out of XML. It's at
<a href="http://xmlgrid.net/xml2text.html">http://xmlgrid.net/xml2text.html</a>
and might be useful in exploring the PubChem data in Excel.
- Ralph
For some reason several Zip programs would not unzip the LCSS dump. I was able to do so with 7-zip, but it is too big to upload to our site. Fortunately, UALR has unlimited Google Drive, and so I put the original file here.
<a href="https://drive.google.com/open?id=0ByRWZ4TaLO_0NndIMHFfVF9PbVk">https://drive.google.com/open?id=0ByRWZ4TaLO_0NndIMHFfVF9PbVk</a>
I had to remove the following in order to load it to Excel
<a href="https://drive.google.com/open?id=0ByRWZ4TaLO_0NndIMHFfVF9PbVk">https://drive.google.com/open?id=0ByRWZ4TaLO_0NndIMHFfVF9PbVk</a>
And here it is in Excel,
<a href="https://drive.google.com/open?id=0ByRWZ4TaLO_0NndIMHFfVF9PbVk">https://drive.google.com/open?id=0ByRWZ4TaLO_0NndIMHFfVF9PbVk</a>
Now Brian and I found another way to get the list of chemicals for which there is a PubChem LCSS. Brian, are you going to try and use the NIH Resolver to convert the names to InChI-Key (I think it does that). Sort of the way we did in the second module? (except get InChI Key instead of molar mass).
Google docs doesn't allow me access to the Excel files you posted; do you need to share them with me? My google account is <a href="mailto:keenestateehs@gmail.com">keenestateehs@gmail.com</a>
I generated an Excel file that compares Wikipedia safety information with PubChem LCSS information by hand for about 90 chemicals. The list of chemicals came from the original LCSS roster in Prudent Practices. In the process, I suspect that I made clerical errors, so replicating this effort electronically would be a good step. I hope that the Excel file that Bob generated will help us do this.
These are the columns in the spreadsheet and what they indicate:
Wikipedia Entry Is there a wikipedia entry for this chemical
Chembox Does the wikipedia entry have a Chembox
Safety info Does the chembox contain any safety information?
GHS info Does the chembox contain GHS information
PubChem LCSS? Is there a PubChem LCSS for this chemical?
Pubchem sections How many content sections are there in the Pubchem LCSS
Number of sets How many different sets of GHS symbols does PubChem present?
Distinct sets How many different sets of symbols are there?
The interesting result is that wikipedia has more coverage of the chemicals listed (95% vs, 82%), but less safety information (78% in wikipedia, 82% in Pubchem) and much less GHS info (33% of chemical listed have GHS info in wikipedia).
Brian, is it possible to verify these numbers?
Can you upload the file to this page, or the Google drive I just shared with you. It would be nice to see how you structured the file.
I just uploaded the file to the page. "Structured" is a little generous - it was more a note taking device with some calculations at the bottom. But I think that it gives us an idea of how to generate interesting data for SD.
- Ralph
Ive uploaded an excel page with the list of chemicals from the LCSS. There are four tabs, Sheet0, Sheet1, Sheet2, and Sheet3. Sheet3 is the filtered down list in alphabetical order. The other sheets are just part of the filtering process. There are many duplicates which I can get rid of and there are many chemicals listed by pubchem as numbers.
I've uploaded a first draft of an overview of how I think the mining of the PubChem and Wikipedia Chemboxes can help us work with chemical safety information that is available on the web. The numbers in the paper are based on my "spare time", manual review of the 88 or so chemicals named in the 1995 edition of Prudent Practices last week. It would be great if we could double check them electronically and expand the number that are reviewed using the XML data from PubChem and scraping of the ChemBoxes.
With the pubchem identifier exchange service we can convert pubchem IDs to inchi keys as well as other things. It can be found at <a href="https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi">https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi</a>
Here are some interesting stuff,
<a href="http://chem-bla-ics.blogspot.com/2016/01/adding-chemical-compound-to-wikidata.html">http://chem-bla-ics.blogspot.com/2016/01/adding-chemical-compound-to-wikidata.html</a>
This lead me to Hay's tools:
<a href="http://tools.wmflabs.org/hay/">http://tools.wmflabs.org/hay/</a>
of which the tool directory may be of interest
<a href="http://tools.wmflabs.org/hay/directory/">http://tools.wmflabs.org/hay/directory/</a>
<a href="https://www.ebi.ac.uk/efo/webulous/">https://www.ebi.ac.uk/efo/webulous/</a>
Comments 11
Embedded Sheet
XML utility web site
LCSS Data Dump
LCSS Data Dump
Safety information: PubChem vs Wikipedia
Small Excel File
Small Excel File
New files for the library
With the pubchem identifier
Links related to Wikimedia and Wikidata