10. Dictionary-Based Chemical Name-Structure Association Clearn-Up Algorithm

Members : TBD

Mentor : Sunghwan Kim (PubChem/NCBI)


Project Description

During this course, you will encounter many cases where a chemical name does not match its structure in a chemical database. For example, some chemicals whose names contain a string "sodium" do not have any Na atoms, and chemicals with "acetate" in its name does not have an acetate unit in its structure. It sounds very awkward to somebody but those cases do exist in many chemical databases, and we will discuss this topic very frequently during the course.

The proposed project is about developing a dictionary-based algorithm that identifies potentially incorrect chemical name-structure associations.



This algorithm will consist of the following steps.

  1. Generate a list of common chemical fragments' (or units') names and structures (in SMILES).
  2. for each chemical fragment (let's take "fumarate" as an example), repeat these:
    1. Retrieve all compounds whose name contains the string "fumarate". Because the retrieved compounds have "fumarate" in their names, they are expected to have "fumarate" unit in their structures, too.
    2. Run a substructure search against the compounds retrieved from (2a), using the SMILES string for "fumarate" as a query. The resulting chemicals have the "fumarate" string in their name and the "fumarate" unit in their structure, so the name-structure associations are considered to be correct.
    3. Take the difference between the results of (2a) and (2b). The difference corresponds to the structures whose names have the "fumarate" string, but which don't have "fumarate" structure unit. Therefore, the name-structure associations in these compounds are potentially incorrect.
    4. Analyze the name-structure associations for chemicals from (2c). This step intends to identify some exceptional cases (e.g., compounds whose name contain a string like "calcium-free" are not likely to have "calcium" in its structure, although the name contains the string "calcium".)

This project is pretty straightforward, with a little bit of programming skills (which I think one can learn within a week or two), although making a chemical fragment name-structure list would be somewhat tedious. However, this tackles a very important issue in cheminformatics, so I consider the resulting algorithm will be very practical.

No votes yet