1. Projects involving MolView: Herman Bergwerf <firstname.lastname@example.org>
2. Acquisition of NIST IR and MS SVG Files for the CheMagic Model Kit Mini (MKM): Otis Rothenberger <email@example.com>
3. Using Wikipedia to Interface PubChem and ChemSpider for international users
4. A Chemistry Add-on for Libre/OpenOffice Calc
5. Activity: Analysis of text frequency in PubChem
1. Localizing IUPAC names: Herman Bergwerf <firstname.lastname@example.org>
I created MolView (http://molview.org) as a hobby project. MolView is a free and open-source application and works in your browser. The main goal of MolView is to make various chemical databases easily accessible. You can draw a structural formula and retrieve the 3D structure from the PubChem database by clicking 2D to 3D. You can also type a search term into the search box (top-left) and select one of the suggested records.
In this 'auto-complete' search box, you can also type IUPAC names. These names are matched againts all connected databases the same way a any other text string that is typed into this box (using string similarity algorithms). Of course there should be a better way of doing this! And wouldn't it be awesome if users can also enter a systematic name (that uses the same semantics as IUPAC names) in their own language? The goal of this project is to describe a method to do this and test it's viability.
Complete list of goals
Below are three parts that could be used for completing this project.
Part 1, breaking down IUPAC names:
- Desribe all IUPAC name semantics (prefixes, suffixes, sequence, ...) (or at least a very significant subset) You could use a flowchart to break down the logic. (http://www.draw.io/ is an excellent program for this)
- Try to create a translation table for IUPAC names for a non-English language (Spanish, French, German, Dutch, ...) It's best to pick a language you are proficient with.
Part 2, validate the translation method:
- Manually or programmatically (highly recommended) create a list of translated IUPAC names.
- Manually or programmatically (for example using Wikipedia) look up the actual translated IUPAC names.
Part 3, recognize and predict IUPAC names:
- When can you be sure the user is typing a IUPAC name?
- Create a method to find all possible successions (prefixes, suffixes) for a partial IUPAC name. This way the user can be presented with a list of suggestions for the next prefix/suffix.
Using Wikipedia to Interface PubChem and ChemSpider for international users , Jordi Cuadros <email@example.com>
Chemistry databases like PubChem and ChemSpider are increasingly important to chemists. However, they lack the capability to search by name in any language. This work proposes to overcome this limitation by using the international wikipedias as a translation tool from a localized chemical name to an English one. The later will then be used to query PubChem, ChemSpider... by automatically creating appropriate URLs.
A Chemistry Add-on for Libre/OpenOffice Calc, Jordi Cuadros <firstname.lastname@example.org>
Worksheets are ubiquitous in the daily work of chemists. However, solutions to access chemical information are not readily available, especially in free software environments. This project will start building a set of functions for Libre/OpenOffice Calc to ease the access to existing chemistry public databases.
Activity: Analysis of text frequency in PubChem , Leah McEwen <email@example.com>
Outcome: generate synonym lists that can be potentially added to an ontology or classification system
Learning Objectives: Students will learn about computer interpretation, human variability, synonymy, using some quick and dirty approaches to analyzing variability in text.
Description: there are many many properties in an aggregated database such as PubChem where the units are contained in text strings and inconsistently delineated. For example, reported data for melting point might appear with units such as C, deg, degrees, o, F, Fahrenheit, Celsius. This complicates further validation and analysis of the data. The text terms that appear most frequently associated with various properties can form the bases of synonym lists and improve the structure of the data through an ontology or classification system.
Tasks: the project may involve the following tasks, depending on prior skills and interests
1- ‘extract’ text terms by removing numbers and punctuation
2- analyze the frequency of terms and represent in a histogram
3- determine the equivalence of various terms and assign low level concept terms
4- conduct process for 2-3 different properties, starting with a simple one to refine steps
Skills: Students previously familiar with macros or basic scripting (e.g., VGA, Python) can use these tools to generate histograms. Histograms can also be pre-created for analysis as these skills have not been covered in the course to date. Alternate approach without scripting is to use Excel functions to count frequency of variance of possible term roots which can still generate useful synonym data.