Potential OLCC Project with Mentors

Students, click on the above title and you will get information on some potential projects we have mentors for. 1. Herman Bergwerf: Projects involving MolView 2. Otis Rothenberger: Projects involving CheMagic Model Kit Mini (MKM) 3. Jordi Cuadros: Using Wikipedia to Interface PubChem and ChemSpider for international users 4. Jordi Cuadros: A Chemistry Add-on for Libre/OpenOffice Calc 5. Leah McEwen: Analysis of text frequency in PubChem

1. Projects involving MolView: Herman Bergwerf <hermanbergwerf@gmail.com>
2. Acquisition of NIST IR and MS SVG Files for the CheMagic Model Kit Mini (MKM): Otis Rothenberger <osrothen@icloud.com>
3. Using Wikipedia to Interface PubChem and ChemSpider for international users
4. A Chemistry Add-on for Libre/OpenOffice Calc
5. Activity: Analysis of text frequency in PubChem

1. Localizing IUPAC names: Herman Bergwerf <hermanbergwerf@gmail.com>

I created MolView (http://molview.org) as a hobby project. MolView is a free and open-source application and works in your browser. The main goal of MolView is to make various chemical databases easily accessible. You can draw a structural formula and retrieve the 3D structure from the PubChem database by clicking 2D to 3D. You can also type a search term into the search box (top-left) and select one of the suggested records.

In this 'auto-complete' search box, you can also type IUPAC names. These names are matched againts all connected databases the same way a any other text string that is typed into this box (using string similarity algorithms). Of course there should be a better way of doing this! And wouldn't it be awesome if users can also enter a systematic name (that uses the same semantics as IUPAC names) in their own language? The goal of this project is to describe a method to do this and test it's viability.

Complete list of goals

Below are three parts that could be used for completing this project.

Part 1, breaking down IUPAC names:

Desribe all IUPAC name semantics (prefixes, suffixes, sequence, ...) (or at least a very significant subset) You could use a flowchart to break down the logic. (http://www.draw.io/ is an excellent program for this)
Try to create a translation table for IUPAC names for a non-English language (Spanish, French, German, Dutch, ...) It's best to pick a language you are proficient with.

Part 2, validate the translation method:

Manually or programmatically (highly recommended) create a list of translated IUPAC names.
Manually or programmatically (for example using Wikipedia) look up the actual translated IUPAC names.

Part 3, recognize and predict IUPAC names:

When can you be sure the user is typing a IUPAC name?
Create a method to find all possible successions (prefixes, suffixes) for a partial IUPAC name. This way the user can be presented with a list of suggestions for the next prefix/suffix.

2. Acquisition of NIST IR and MS SVG Files for the CheMagic Model Kit Mini (MKM) Otis Rothenberger <osrothen@icloud.com>

The CheMagic Model Kit Mini (MKM) is not a research tool. It’s the virtual equivalent of a classical classroom/student model kit. Since the SMILES of any model constructed is always known, the MKM is also a data query tool. For pedagogic reasons, the acquisition of NIST IR and MS SVG files is quite useful. NIST, however, is not set up for easy direct access to spectra. Using a proxy server, it can be done, but it's a laborious process involving multiple proxy queries to NIST web pages - not databases, webpages!

My project suggestion is the development of a webpage that uses JavaScript only to access NIST SVG spectra directly. The page that I envision would process an InChIkey query and rely on an advanced Google search to go directly to NIST images:

https://www.google.com/search?as_st=y&tbm=isch&as_q=NIST+IR+MS&as_epq=LFQSCWFLJHTTHZ-UHFFFAOYSA-N&as_oq=&as_eq=&imgsz=&imgar=&imgc=&imgcolor=&imgtype=&cr=&as_sitesearch=&safe=images&as_filetype=&as_rights=

I’m sure the above query can be improved. I also realize that a research oriented project would seek Jcamp data. This project, however, is about pedagogy - in this case a quick common-compound spectrum acquisition that can be easily incorporated into the MKM built in student/teacher notebook.

I should mention that some JavaScript fiddling of the Google Image URLs will be required - e.g. 2-octanone:

From
http://webbook.nist.gov/cgi/cbook.cgi?Spec=C111137&Index=0&Type=Mass
To
http://webbook.nist.gov/cgi/cbook.cgi?Spec=C111137&Index=0&Type=Mass&Large=on&SVG=on

I envision the page as incorporating the Google images response in an iframe with appropriate Google/NIST citations. If a production page emerges from this suggested project, I would be happy to host it on CheMagic with appropriate credit to student authors. CheMagic is a completely free chemistry resource site supported by CheMagic and Illinois State University - chemagic.com/molecules/mini.html

Student Skills Require - basic web page development
Useful Skills - knowledge of AJAX, JavaScript, Jquery

At least one student in the group would need the Useful Skills.

Using Wikipedia to Interface PubChem and ChemSpider for international users , Jordi Cuadros <jordi.at.iqs@gmail.com>
Chemistry databases like PubChem and ChemSpider are increasingly important to chemists. However, they lack the capability to search by name in any language. This work proposes to overcome this limitation by using the international wikipedias as a translation tool from a localized chemical name to an English one. The later will then be used to query PubChem, ChemSpider... by automatically creating appropriate URLs.

A Chemistry Add-on for Libre/OpenOffice Calc, Jordi Cuadros <jordi.at.iqs@gmail.com>
Worksheets are ubiquitous in the daily work of chemists. However, solutions to access chemical information are not readily available, especially in free software environments. This project will start building a set of functions for Libre/OpenOffice Calc to ease the access to existing chemistry public databases.

Activity: Analysis of text frequency in PubChem , Leah McEwen <lrmcewen1@gmail.com>

Outcome: generate synonym lists that can be potentially added to an ontology or classification system

Learning Objectives: Students will learn about computer interpretation, human variability, synonymy, using some quick and dirty approaches to analyzing variability in text.

Description: there are many many properties in an aggregated database such as PubChem where the units are contained in text strings and inconsistently delineated. For example, reported data for melting point might appear with units such as C, deg, degrees, o, F, Fahrenheit, Celsius. This complicates further validation and analysis of the data. The text terms that appear most frequently associated with various properties can form the bases of synonym lists and improve the structure of the data through an ontology or classification system.

Tasks: the project may involve the following tasks, depending on prior skills and interests
1- ‘extract’ text terms by removing numbers and punctuation
2- analyze the frequency of terms and represent in a histogram
3- determine the equivalence of various terms and assign low level concept terms
4- conduct process for 2-3 different properties, starting with a simple one to refine steps

Skills: Students previously familiar with macros or basic scripting (e.g., VGA, Python) can use these tools to generate histograms. Histograms can also be pre-created for analysis as these skills have not been covered in the course to date. Alternate approach without scripting is to use Excel functions to count frequency of variance of possible term roots which can still generate useful synonym data.

Rating:

No votes yet

Join the conversation.

Comments 2

Non-scripting Alternative

I can suggest a variation of my original proposal that is definitely less technical, yet it could lead to a useful end product (an informative paper). Over the years there have been a number of articles, papers, and blogs on the subject of Google as a possible serious chemical data tool - Henry Rzepa, Rich Apodaca, Egon Willighagen, and Christopher Southan to name a few. Using the latter contributor’s paper "InChI in the wild: an assessment of InChIKey searching in Google” as a model, it might be interesting for a group of students to reinvestigate this 2013 paper with Google and possibly Wikipedia in mind: <a href="http://www.jcheminf.com/content/5/1/10">http://www.jcheminf.com/content/5/1/10</a> I will put up a quick and dirty version (later today) of the Mini Model Kit that illustrates the technical concept (original proposal). The version will have two buttons - one going directly to NIST and the other going to NIST via Google for specific NIST info. This would just be an example of the more general idea of Google as a possible serious tool, - i.e. the main theme of the suggested non-technical project variation. Things for students to explore: 1) Advanced Google searches in general - not just for NIST images 2) Persistence of Google search results and resulting provenance problems 3) Possible ways of dealing with item 2. 4) The problem of query truncation, particularly re InChI 5) InChI vs InChIKey 6) Other popular search engine responses to InChI and InChIKey queries 7) Advanced search capability of various popular engines I’m sure more points would come up in the study. I'll post a short note here when I have the Mini Model Kit modification running. Best, Otis

Non-scripting Alternative Project Follow Up

I put the promised quick and dirty Googe/NIST version of the Model Kit Mini Up. If you've been to the page recently, clear browser cache: <a href="http://chemagic.com/molecules/mini.html">http://chemagic.com/molecules/mini.html</a> Have some fun with it: 1. Click the O atom button and click a methane H to make methanol. 2. Click the NIST Google button and the subsequent yellow link. That's Google filtering the NIST results for the model in the window - filtering for spectra. 3) Click the NIST Direct button and the subsequent yellow link. That's the direct link to the related NIST page for the model in the window. Cute (probably overly cute) title for a paper related to this type of Google use: "Google as Webpage Scraper" Google is the biggest Webpage scraper in the world! If you Google the above title, you'll get a ton of hits related to apps and extensions - not hits about using Google for what it is - a webpage scraper. Best, Otis

Comments 2

Annotations