Discussion

Brandon Davis (not verified) | Sat, 09/12/2015 - 12:38
I saved a file in 2 formats like in the article, and the docx file was 12kb and the .doc was 22kb. I was curious what factors contribute to the discrepancy between my filed and those described in the article.

Kourtnei Rooks (not verified) | Thu, 09/10/2015 - 17:14
I am having a hard time with copying the table from the webpage and transferring the information to excel. When I add the table to the excel sheet I get an error. Could tell me what I could be doing wrong with this assignment?

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 17:12
Look at the my comment @ <a href="http://olcc.ccce.divched.org/comment/286#comment-286">http://olcc.ccce.divched.org/comment/286#comment-286</a>. You can create URLs in Excel cells that point to the CIR site for both CAS # and molecular weight.

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 17:04
I was not expecting students to bring a large number of chemicals into Excel for this exercise so I was not expecting this to be a burden. I would talk to your advisor about how many of the chemicals you have imported and how many molecular weights they would like to see. This is not about making it an onerous task, more about the process and learning by doing. If you wanted to be creative with this you could write an Excel function to generate a URL to access the molecular weight data on to the Chemical Identifier Resolver site (cactus.nci.nih.gov) based off of the name of the compound. I will leave you to think about how to do this... Now your question about importing data is actually a great segway into another topic that is coming up later in the course - using Excel as a database and importing data from remote sources. You can do this in Excel for Windows but not Mac and you have to find the right source. I am not going to say any more so as not to steal the thunder of one of my fellow lecturers...

Sarah House (not verified) | Thu, 09/10/2015 - 17:03
Is there an easy way to convert the cells on microsoft excel into their CAS numbers and molecular weights without having to look up each individual chemical and using copy paste?

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 16:48
Did you mean XML > .doc? If that's the case then, yes XML files tend to be large because of all the tags in them. However, Microsoft realized that and so because of size, and the need to have multiple XML files to represent a Word document they zipped the folder of files and then changed the extension to .docx. The difference in size for .docx files varies though because if image files are stored inside the folder they do not compress much as they are normally already compressed. So, you will see the biggest size difference for 'text only' Word documents. If I have misinterpreted your question let me know...

Brandon Davis (not verified) | Thu, 09/10/2015 - 16:37
I thought that the xml document format was supposed to be smaller than the older office files. I saved two files that each said "cheminformatics" and the .docx one was about half the size of the .doc. Is there a setting in word that adds extra compression?

Leah Rae McEwen | Thu, 09/10/2015 - 13:52
Brandon, excellent point about blogs as "gray" information sources. Blogs poignantly illustrate both the opportunities and challenges of "gray" sources and we why designate them this way in the library environment. It is a great opportunity to foster more scientific exchange, extending the value of conferences and research hallways. The greatest concern is always reliability, as discussed in this module and some other posts. It behooves any user of scientific information to consider some level of evaluation for their context and this is certainly true for "gray" sources that are not always well characterized as to intended audience, authority, accuracy, and bias. That said, there are some very interesting scientific exchanges published in blogs by well respected professionals, research labs and scientific organizations. You asked about chemistry-centric blogs, a list is maintained on the "Chemical blogspace" as a place to start: <a href="http://cb.openmolecules.net/">http://cb.openmolecules.net/</a> (currently maintained by Geoff Hutchison and the University of Pittsburgh) About two years ago, a cheminformatics colleague analyzed the appearance of blog links on these blogs as a proxy for identifying the top 5 (<a href="http://baoilleach.blogspot.com/2013/12/top-5-favourite-blogs-of-chemistry.html">http://baoilleach.blogspot.com/2013/12/top-5-favourite-blogs-of-chemistry.html</a>). The top 5 appearing then are below (the code for this analysis is on that blog, anyone want to run the updated numbers?) "In the Pipeline" (<a href="http://pipeline.corante.com">http://pipeline.corante.com</a>, Derek Lowe) "Chem-bla-ics" (<a href="http://chem-bla-ics.blogspot.com">http://chem-bla-ics.blogspot.com</a>, Egon Willighagen) "ChemBark" (<a href="http://blog.chembark.com">http://blog.chembark.com</a>, Paul Bracher) "Orp Prep Daily" (<a href="http://orgprepdaily.wordpress.com">http://orgprepdaily.wordpress.com</a>, Milkshake) "The Sceptical Chymist" (<a href="http://blogs.nature.com/thescepticalchymist">http://blogs.nature.com/thescepticalchymist</a>, Nature Chemistry) Other popular blogs are: ChemJobber (<a href="http://chemjobber.blogspot.com">http://chemjobber.blogspot.com</a>, anonymous) SafetyZone (<a href="http://cenblog.org/the-safety-zone">http://cenblog.org/the-safety-zone</a>, Chemical & Engineering News) Henry Rzepa (<a href="http://www.ch.ic.ac.uk/rzepa/blog">http://www.ch.ic.ac.uk/rzepa/blog</a>, Imperial College London) Retraction Watch (<a href="http://retractionwatch.com">http://retractionwatch.com</a>, <a href="http://retractionwatch.com/meet-the-retraction-watch-staff">http://retractionwatch.com/meet-the-retraction-watch-staff</a>)

OLCC s12's picture
OLCC s12 | Thu, 09/10/2015 - 13:42
After reading through this section on APIs and then doing the assignment on putting data into Excel, how would you recommend importing any missing data using APIs for a very large list? For my example I pulled data from the OPCW list of chemical weapons which included a number, chemical type, examples and CAS numbers. The molecular weights were not included and going along with my question, how would I find the quickest way to import these molecular weights using APIs in Excel. At Least without having to copy and past individual numbers that were manually looked up in a web browser based on the API URL structure you explained using cactus.nci.nih.gov. It seems A large list would likely take many weeks to just find the molecular weights.

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 09:41
While you can certainly get the data out of these formats, my intent in setting up the assignment was that the data be accessible as part of the HTML in the webpage. Having said that, you can process image files to text if you have access to a full copy of Adobe Acrobat (not the reader) or other software that can perform optical character recognition (OCR). In Adobe Acrobat you 'Create PDF from file' and then select the image file you want, and run Acrobat's text extraction tool (I recommend using the ClearScan option in the dialog). Once the information is text you can copy and paste across into Excel. If you have a PDF file to begin with you can do the same thing.