2.7 Assignment Exercise

Now that you know enough about computers and metadata to be information gurus here's an exercise to put your skills to the test. You will create an Excel spreadsheet of chemical property data and metadata.

  1. Using the research skills you learned in module 1a find a web page that has a lot of chemical property data on it.  It should not be a page that is part of an existing online database, rather a page that is maintained by a science professional(s) as a service to the community. It should also have the following attributes:
  2. Save the web page as a text file on your computer.  Most browsers offer you the option to save as text.  This removes any HTML formatting and makes it easier to work with the actual data
  3. Import that text file into Excel.  Do this by opening Excel first and then opening the text from the file menu.  You should see the dialog box below.  In most cases the 'delimited' option works best for importing data.  You pick a character on the second dialog page to allow Excel to know how to put the data across different columns (tab, pipe (|) , comma, etc...).  This will help with organizing the data in the next step.
    Excel dialog for text file import
  4. Organize the data that you have imported in columns (different data or metadata) and rows (compounds).  Give a title to each column and underneath it identify the datatype for the column (text, enum, integer, float etc.).  Some columns may be more than one datatype so try and pick the best one - i.e. the one that is most appropriate for all values that might logically be in that column.  Include columns that contain the Chemical Abstracts Registry Number (CAS number) and molecular weights for all of the compounds in the table if they are not already part of the data.  An example of what this might look like is below.

    Excel assignment file

 

Rating: 
0
No votes yet
Join the conversation.

Comments 23

Brian Murphy | Thu, 09/10/2015 - 09:25
Quite a few tables online are in JPEG and PDF format. Is there a preferred method to get that data into an excel spreadsheet or is an online PDF to xls converter the only option?

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 09:41
While you can certainly get the data out of these formats, my intent in setting up the assignment was that the data be accessible as part of the HTML in the webpage. Having said that, you can process image files to text if you have access to a full copy of Adobe Acrobat (not the reader) or other software that can perform optical character recognition (OCR). In Adobe Acrobat you 'Create PDF from file' and then select the image file you want, and run Acrobat's text extraction tool (I recommend using the ClearScan option in the dialog). Once the information is text you can copy and paste across into Excel. If you have a PDF file to begin with you can do the same thing.

Sarah House (not verified) | Thu, 09/10/2015 - 17:03
Is there an easy way to convert the cells on microsoft excel into their CAS numbers and molecular weights without having to look up each individual chemical and using copy paste?

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 17:12
Look at the my comment @ <a href="http://olcc.ccce.divched.org/comment/286#comment-286">http://olcc.ccce.divched.org/comment/286#comment-286</a>. You can create URLs in Excel cells that point to the CIR site for both CAS # and molecular weight.

Kourtnei Rooks (not verified) | Thu, 09/10/2015 - 17:14
I am having a hard time with copying the table from the webpage and transferring the information to excel. When I add the table to the excel sheet I get an error. Could tell me what I could be doing wrong with this assignment?

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 07:20
I can only speculate as to the reason for this. If you are trying to copy and paste a table direct from the webpage, my guess is that there is/are some special characters in the webpage that Excel does not like. If you save the webpage as a plain text file then I think you will have more luck. If this does not help send me an email and I will troubleshoot and post the answer.

Daniel Graham (not verified) | Mon, 09/14/2015 - 19:37
How important is it to complete this assignment in Excel vs. OpenOffice? Is Excel an industry/academic standard that I should definitely learn over OpenOffice, or can I use OpenOffice to accomplish the same goal?

Bob Belford's picture
Bob Belford | Mon, 09/14/2015 - 20:00
Hi Daniel, At risk of not being your instructor I will chirp in here. I have some students using LibreOffice, I use Open Office and Excel, and later this semester we will be using Google Docs and Sheets. Our objective is not to teach you how to use Excel, but how you can use Cheminformatic technologies in your scientific workflow. But you need to start somewhere, and unless told otherwise, should probably start with the program you are most comfortable with, and go from there. Personally, I am about to start playing with LibreOffice. Here is an interesting article about it and Open Office. <a href="http://www.howtogeek.com/187663/openoffice-vs.-libreoffice-whats-the-difference-and-which-should-you-use/">http://www.howtogeek.com/187663/openoffice-vs.-libreoffice-whats-the-difference-and-which-should-you-use/</a> What I think is important is realizing that web pages and web browsers are not the only way we can access and use online data. Cheers, Bob Belford

Jennifer Muzyka | Mon, 09/14/2015 - 20:13
As Daniel's instructor, I'm happy to hear the thoughts of faculty members at other institutions. I use Excel and sometimes Numbers. Recently I read that Google Docs could be frustrating due to some of its limitations. But today I read that Google Docs makes it easy to use Web Services like Dr. Belford demonstrated recently with Office 2013. Since I'm a Mac user and still using Office 2011, I may need to investigate the Google Docs approach to implement the automated spreadsheet web lookup functions. Some of the folks who present web-based apps at chemical education conferences use Excel to keep track of their data. I look forward to learning more about how I might adopt some of those approaches. Thanks for the article explaining the differences between OpenOffice and LibreOffice. Jennifer Muzyka

Otis Rothenberger's picture
Otis Rothenberger | Tue, 09/15/2015 - 08:20
Bob, Thanks for your Excel video - very nice. This is particularly useful with Resolver. To my knowledge, Resolver does not blacklist. Students, be careful if you make mass PubChem downloads, PubChem will blacklist an offending IP number. This is a bigger problem for server managers. For this reason, PubChem recommends that web apps download directly to the browser rather than via a server proxy. Still users of these browser apps need to know that it's possible to get PubChem upset with your IP number because of a heavy hit for data in a short period of time. SDBS is an example of a wonderful spectral data source that also had to enforce some restrictions because of mass data collection: <a href="http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cre_index.cgi">http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cre_index.cgi</a> Jennifer, I'm also a Mac user. iWorks Numbers is a nice little spreadsheet for hand entry and routine tasks, but automation is all but gone in recent versions. Still, I use is for all my routine tasks. For students using Macs, Google Sheets has four data import functions. They are all url based. Their names are fairly self explanatory - IMPORTDATA, IMPORTFEED, IMPORTHTML, IMPORTXML. I agree that it does not hurt to have MS experience on a resume, but there are other options available. For me life after Windows (I'm retired.) is carefree and happy with Mac! Regards, Otis

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 07:27
I think as long as the activity is completed correctly in your spreadsheet of choice then, my me, its fine (but consult with your facilitator). I think though that in industrial chemistry laboratories, companies are likely to be all MS products and not open source, as the IT folk will likely require them. So, being able to put 'proficient in MS software' on your resume is a selling point. (I also think that highlighting this course on your resume will get you an even better chance at a job :) )

Joshua Henrich (not verified) | Tue, 09/15/2015 - 17:00
Dr. Chalk, Are there any other good chemical identifier resolver sites (besides cactus.nci.nih.gov)? When my partner and I write an Excel function to generate a URL, there are a good amount of chemical compounds that the nci.nih.gov site cannot find CAS or molecular weight data for.

Stuart Chalk's picture
Stuart Chalk | Wed, 09/16/2015 - 07:40
In a general sense the CIR site is the best and the only site that is specifically a chemical identifier converter (resolver). Other sites you can use to do the same process are <a href="http://pubchem.ncbi.nlm.nih.gov">http://pubchem.ncbi.nlm.nih.gov</a> and <a href="http://chemspider.com">http://chemspider.com</a>.

John House (not verified) | Tue, 09/15/2015 - 17:02
I am using the excel webservice method to convert compound names to CAS numbers. However, when attempting with charged species, such as Chromium (III) sulfate, the chemical resolver cannot return a CAS number. Does anyone have any ideas?

Otis Rothenberger's picture
Otis Rothenberger | Tue, 09/15/2015 - 17:57
John and Joshua, I’d like to make a few points about NIH/CADD Resolver that may be helpful. First, Resolver is not really a chemical data source. It’s primary function is the interconversion of chemical identifiers. I’ll reference the identifiers name, IUPAC, InChI, InChIKey, CAS, SDF (2D and 3D), and SMILES in this note. 1) Name Identifier: This is the weakest identifier of all, and it is Resolver’s weak link. A given chemical compound can have a huge number of names. When Resolver is presented with an input name, it must use look-up. IUPAC is a different matter, but I’ll get to that in a minute. Bottom Line at Resolver: Look-up Weak. 2) CAS Identifier: This is almost as bad as name! Many compounds have more than one CAS number. Here are CAS numbers for ethanol from Resolver: 121182-78-3 64-17-5 8024-45-1 8000-16-6 68475-56-9 71076-86-3 71329-38-9 There are two reasons for this multiplicity: Chemical Formulations and Money (CAS sells these things!). Bottom Line at Resolver: Look-up Weak. 3) IUPAC, InChI, SMILES, SDF Identifiers: These identifiers are all exact codes for the structural atom connectivity of a molecule. OChem Textbooks, of course, cover IUPAC, but IUPAC is only one member of this powerful group. With any of these as input, Resolver uses powerful algorithms to convert input to any of the others that may be need. No Look-up Involved. What the heck is this all about? All databases must store information with reference keys. Not all databases speak the same key language. Resolver’s job is to take the identifier that you have and convert it to the identifier that you need. Bottom Line at Resolver: NO Look-up Strong. For example, the very popular Web structure drawing application JME (also JSME) creates SMILES. NIST allows spectrum lookup by InChI. But if you want to get the actual spectrum at NIST, you also need the CAS number that NIST uses for that particular compound. Yikes! Fortunately, NIST will give it to you if you ask nicely using the InChI - it turns out that they usually use the lowest CAS number. By the way, Google indexes InChI and IUPAC. InChIKey Identifier: We’re back to look-up, but this is look-up with power. InChI’s tend to be long. InChIKey’s are true calculated hashes of InChIKey. They are darn close to being truly unique. Unique is what you want in look-up. Google also indexes InChIKey. So your spread sheet “database” exercise uses Resolver because it’s easy to use, but it’s just a warm up for what’s to come. What might have been lost is this initial warm up exercise is that you were using a tool that is really not a database (Resolver). Rather it’s one of the keys for opening real databases. Otis

Brandon Davis (not verified) | Tue, 09/15/2015 - 17:44
In doing the assignment, i discovered more than one CAS for one chemical. Is it normal for chemicals to have multiple CAS registry numbers?

Stuart Chalk's picture
Stuart Chalk | Wed, 09/16/2015 - 07:45
So, the original indent was that CAS #'s were unique. However, as the list of chemicals given a CAS # grew (its now over 101 million) and the complexity of molecules got larger so chemicals (actually quite a few) were given additional CAS #'s. So, they are not unique but the InChI is...but more about that later in the course.

John House (not verified) | Tue, 09/15/2015 - 23:40
We have figured out a way to get the webservice function to work in older excel versions. I have uploaded a file created in 2013 excel. If using an older version, all you have to do is open my file with your version, clear all of the contents on my sheet, and paste your table. You will now be able to use the webservice function. Likewise, you can copy and paste my table into an older version and get the same results.

Jordi Cuadros's picture
Jordi Cuadros | Wed, 09/16/2015 - 03:07

Dear John,

Although I am not be able to check your approach right now (I will report back as soon as I have a virtual machine set up), it doesn't make sense to me. Could you post a screencast video showing us how you tested it?
In any case, I see two elements that may provoke a misinterpretation there: (1) an older file type is not the same as an older version of Excel (WEBSERVICE will work in Excel 2013 even if you save your file in an earlier format), (2) when you open a file that have some functions that not available in your current Excel version, the software may keep the values that were saved with Excel 2013; a manual calculation of the whole worksheet (F9) should make clear whether the function is working or not.

John House (not verified) | Wed, 09/16/2015 - 07:37
Absolutely, we should be able to get one by Thursday. We found that when opening files created in excel 2013, the functions still worked. We believe that the script for the function must be saved within the sheet itself.

Jordi Cuadros's picture
Jordi Cuadros | Thu, 09/17/2015 - 17:07

I just checked your spreadsheet in Excel 2010 32-bit on a Windows 7 virtual machine and it does not work for me (as expected). In the screenshot, you can see the error that result when the WEBSERVICE function is calculated there. Last rows have not been recalculated.

John House (not verified) | Thu, 09/17/2015 - 19:51
We did not properly vet this method before sharing. It worked on one person's device but we have since not been able to repeat it on any others. We are looking into it further and will update.

Brian Murphy | Wed, 09/16/2015 - 17:01
How would you cite sources for a spreadsheet?

Annotations