Discussion

Otis Rothenberger's picture
Otis Rothenberger | Tue, 09/15/2015 - 17:57
John and Joshua, I’d like to make a few points about NIH/CADD Resolver that may be helpful. First, Resolver is not really a chemical data source. It’s primary function is the interconversion of chemical identifiers. I’ll reference the identifiers name, IUPAC, InChI, InChIKey, CAS, SDF (2D and 3D), and SMILES in this note. 1) Name Identifier: This is the weakest identifier of all, and it is Resolver’s weak link. A given chemical compound can have a huge number of names. When Resolver is presented with an input name, it must use look-up. IUPAC is a different matter, but I’ll get to that in a minute. Bottom Line at Resolver: Look-up Weak. 2) CAS Identifier: This is almost as bad as name! Many compounds have more than one CAS number. Here are CAS numbers for ethanol from Resolver: 121182-78-3 64-17-5 8024-45-1 8000-16-6 68475-56-9 71076-86-3 71329-38-9 There are two reasons for this multiplicity: Chemical Formulations and Money (CAS sells these things!). Bottom Line at Resolver: Look-up Weak. 3) IUPAC, InChI, SMILES, SDF Identifiers: These identifiers are all exact codes for the structural atom connectivity of a molecule. OChem Textbooks, of course, cover IUPAC, but IUPAC is only one member of this powerful group. With any of these as input, Resolver uses powerful algorithms to convert input to any of the others that may be need. No Look-up Involved. What the heck is this all about? All databases must store information with reference keys. Not all databases speak the same key language. Resolver’s job is to take the identifier that you have and convert it to the identifier that you need. Bottom Line at Resolver: NO Look-up Strong. For example, the very popular Web structure drawing application JME (also JSME) creates SMILES. NIST allows spectrum lookup by InChI. But if you want to get the actual spectrum at NIST, you also need the CAS number that NIST uses for that particular compound. Yikes! Fortunately, NIST will give it to you if you ask nicely using the InChI - it turns out that they usually use the lowest CAS number. By the way, Google indexes InChI and IUPAC. InChIKey Identifier: We’re back to look-up, but this is look-up with power. InChI’s tend to be long. InChIKey’s are true calculated hashes of InChIKey. They are darn close to being truly unique. Unique is what you want in look-up. Google also indexes InChIKey. So your spread sheet “database” exercise uses Resolver because it’s easy to use, but it’s just a warm up for what’s to come. What might have been lost is this initial warm up exercise is that you were using a tool that is really not a database (Resolver). Rather it’s one of the keys for opening real databases. Otis

Brandon Davis (not verified) | Tue, 09/15/2015 - 17:44
In doing the assignment, i discovered more than one CAS for one chemical. Is it normal for chemicals to have multiple CAS registry numbers?

John House (not verified) | Tue, 09/15/2015 - 17:02
I am using the excel webservice method to convert compound names to CAS numbers. However, when attempting with charged species, such as Chromium (III) sulfate, the chemical resolver cannot return a CAS number. Does anyone have any ideas?

Joshua Henrich (not verified) | Tue, 09/15/2015 - 17:00
Dr. Chalk, Are there any other good chemical identifier resolver sites (besides cactus.nci.nih.gov)? When my partner and I write an Excel function to generate a URL, there are a good amount of chemical compounds that the nci.nih.gov site cannot find CAS or molecular weight data for.

Otis Rothenberger's picture
Otis Rothenberger | Tue, 09/15/2015 - 08:20
Bob, Thanks for your Excel video - very nice. This is particularly useful with Resolver. To my knowledge, Resolver does not blacklist. Students, be careful if you make mass PubChem downloads, PubChem will blacklist an offending IP number. This is a bigger problem for server managers. For this reason, PubChem recommends that web apps download directly to the browser rather than via a server proxy. Still users of these browser apps need to know that it's possible to get PubChem upset with your IP number because of a heavy hit for data in a short period of time. SDBS is an example of a wonderful spectral data source that also had to enforce some restrictions because of mass data collection: <a href="http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cre_index.cgi">http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cre_index.cgi</a> Jennifer, I'm also a Mac user. iWorks Numbers is a nice little spreadsheet for hand entry and routine tasks, but automation is all but gone in recent versions. Still, I use is for all my routine tasks. For students using Macs, Google Sheets has four data import functions. They are all url based. Their names are fairly self explanatory - IMPORTDATA, IMPORTFEED, IMPORTHTML, IMPORTXML. I agree that it does not hurt to have MS experience on a resume, but there are other options available. For me life after Windows (I'm retired.) is carefree and happy with Mac! Regards, Otis

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 08:03
Metadata is the term used to describe information that contextualizes other information, or data about data. Sounds very abstract but actually makes sense if you think about an example. Consider the data in the citation managers in the last module. If you consider the content of a scientific paper 'data' then the metadata for that are it descriptors; title, author(s), journal, volume, issue, pages, year, author address, etc... The metadata is the information that characterizes a piece of data and as a consequence allows you to search for it in a database. Metadata as a term comes out of the library community and it's everywhere. Think about searches you do online at amazon.com for instance. When you search for say a thumb drive, it is will show you items you can buy and present you with options on the left-hand side that you can refine the search by, size, brand, color, special features - all of these are metadata about the different thumb drives that allow you to narrow down the search.

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 07:43
Sadly, MS does not make an equivalent to MS Access. Apple has Numbers, and Apple's subsidiary Filemaker has a good database. If you want more a simple relational database you might try the open source 'DB Browser for SQLite' (<a href="http://sqlitebrowser.org/">http://sqlitebrowser.org/</a>). For me personally there is no substitute for MySQL (<a href="http://dev.mysql.com/downloads/mysql/">http://dev.mysql.com/downloads/mysql/</a>) as the Community Server edition is free (no support from oracle but there is great online documentation and a wealth of knowledge on the web about have to use it). The downside to this is it requires a webserver, however you can install all you need if you use MAMP (<a href="https://www.mamp.info/">https://www.mamp.info/</a>) which installs open source versions of Apache, MySQL, PHP and phpMyAdmin a web browser for MySQL written in PHP on either Mac or PC. It is awesome!

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 07:27
I think as long as the activity is completed correctly in your spreadsheet of choice then, my me, its fine (but consult with your facilitator). I think though that in industrial chemistry laboratories, companies are likely to be all MS products and not open source, as the IT folk will likely require them. So, being able to put 'proficient in MS software' on your resume is a selling point. (I also think that highlighting this course on your resume will get you an even better chance at a job :) )

Stuart Chalk's picture
Stuart Chalk | Tue, 09/15/2015 - 07:20
I can only speculate as to the reason for this. If you are trying to copy and paste a table direct from the webpage, my guess is that there is/are some special characters in the webpage that Excel does not like. If you save the webpage as a plain text file then I think you will have more luck. If this does not help send me an email and I will troubleshoot and post the answer.

Joshua Henrich (not verified) | Tue, 09/15/2015 - 00:59
Dr. Chalk, What exactly is metadata? Hopefully, i'ts not too basic of a question. I'm seeing most of these terms for the first time.