Discussion | DivCHED CCCE: Cheminformatics OLCC

greetings

My name is Chandler I'm a student at St. Louis College of Pharmacy I'm going for a dual Bachelors of Health Sciences and Pharm.D. I decided to take this class to further my knowledge of programming as I was big on programming in high school.

Milind Khadilkar (Mumbai, India. Cheminformatics enthusiast)

Hi all,
Thanks, Prof. Belford for your continued leadership.
I am a software-mathematics professional who got re-introduced to Chemistry during the early part of my programming career in the 1980s through some excellent books (most notably, 107 Stories in Chemistry) and associating with my former collegemates who went on to major in Chemistry. I became professionally involved with cheminformatics at multiple points as a software consultant, and have a continuing interest in it. I have taught cheminformatics informally to both chemists and software developers, and have written software that utilized InChI while InChI was yet to be released. My drawback is that I know little chemistry and have no personal acquaintance with equipment and chemicals and cannot distinguish between, say, H2O and NaCl by their looks. I usually use Python for programming and am exploring the associated scientific packages that loosely make up Scientific Python.
I have been following these OLCC/DevChed initiatives for quite a few years, usually passively. I hope to learn more through this edition. Thanks and best wishes.

Project: chemical name-structure association clean-up algorithm

During this course, you will encounter many cases where a chemical name does not match its structure in a chemical database. For example, some chemicals whose names contain a string "sodium" do not have any Na atoms, and chemicals with "acetate" in its name does not have an acetate unit in its structure. It sounds very awkward to somebody but those cases do exist in many chemical databases, and we will discuss this topic very frequently during the course.

The proposed project is about developing a dictionary-based algorithm that identifies potentially incorrect chemical name-structure associations. This algorithm will consist of the following steps.

(1) Generate a list of common chemical fragments' (or units') names and structures (in SMILES).

(2) for each chemical fragment (let's take "fumarate" as an example), repeat these:

(2a) Retrieve all compounds whose name contains the string "fumarate". Because the retrieved compounds have "fumarate" in their names, they are expected to have "fumarate" unit in their structures, too.

(2b) Run a substructure search against the compounds retrieved from (2a), using the SMILES string for "fumarate" as a query. The resulting chemicals have the "fumarate" string in their name and the "fumarate" unit in their structure, so the name-structure associations are considered to be correct.

(2c) Take the difference between the results of (2a) and (2b). The difference corresponds to the structures whose names have the "fumarate" string, but which don't have "fumarate" structure unit. Therefore, the name-structure associations in these compounds are potentially incorrect.

(2d) Analyze the name-structure associations for chemicals from (2c). This step intends to identify some exceptional cases (e.g., compounds whose name contain a string like "calcium-free" are not likely to have "calcium" in its structure, although the name contains the string "calcium".)

This project is pretty straightforward, with a little bit of programming skills (which I think one can learn within a week or two), although making a chemical fragment name-structure list would be somewhat tedious. However, this tackles a very important issue in cheminformatics, so I consider the resulting algorithm will be very practical.

Best,

Sunghwan,

Please subscribe to the Student Projects section

Hello, everyone.

I'm Sunghwan at PubChem. I've already introduced myself through this comment section, so please see my previous post below from last Monday.

By the way, I would like to ask you all to subscribe the updates & comments for the Student Projects section (http://olcc.ccce.divched.org/Spring2017OLCCStudentProjects). As you may know, students are required to work on projects at the end of this course (likely in April - May, depending on individual school's schedule). However, a semester is not very long, and if students start finding a potential project in late March or April, they may not have enough time to work on the project.

Therefore, I strongly recommend you to start thinking about your projects as early as possible. So, I suggest that all of you subscribe to the updates and comments for the Student Project section (http://olcc.ccce.divched.org/Spring2017OLCCStudentProjects) now. I (and hopefully other faculty members and students, too) will post several ideas to the section, so that we can start some discussion about them. Of course, I strongly encourage students to post their own ideas because we can provide some advice/insight/help about your projects.

Best,

Sunghwan,

Please post potential projects on the student projects section

Dear Professor Bucholtz,

I am also looking forward to working with you and all the others during this course. By the way, I would like to ask you to post your ideas about potential student projects to the Student Projects section (http://olcc.ccce.divched.org/Spring2017OLCCStudentProjects), too. Because one semester is not very long, we may not have enough time to work on student projects (of course, depending on the nature of projects and students' backgrounds) if we start thinking about potential projects later in March or April. So, I think it is better to start discussion now about what projects to work on as early as possible. This will also help students to pay attention to what skill sets they would need for their projects as the course proceed.

Best,

Sunghwan,

Re: Question on Data Structure

Hi Vince,

To expound a bit further on the idea of systematic definition, this generally means that rules govern the types of data that are included, how they are labeled and how they relate to each other. This is different than each instance of data being managed randomly or arbitrarily. Using these rules allows data programmers to assign different properties to the data that can be used later for meaningful analysis. All new data coming into a systematically defined system is subject to these rules and thus can flow automatically into the data structure be used immediately for further functions. The critical thing for automated processing is to define these rules explicitly and make sure they will work in all intended cases so they can be programmed into an algorithm for a computer to process millions of potential compounds automatically.

So for example, if a certain symbol in a linear notation string is defined by the notation rules to indicate an aromatic carbon, then a programmer can build an algorithm that will automatically assign this carbon to an aromatic ring whenever this symbol comes up. The Cahn-Ingold-Prelog (CIP) rules for R/S notation are another example of systematic definition. However, most computer chemical representation standards use different rules for determining parity, so further algorithms are required for computer systems to ascertain chirality from these representations. As Evan suggests, chemical representation standards are continuously evolving!

I hope this helps!

Best,
Leah

Potential Project: Biology underlying legal highs

This potential project aims to:

(1) provide an overview of biology underlying controlled substances (i.e., illegal drugs),
(2) identify potential "legal highs" that has not been regulated by the authority (e.g., U.S. Drug Enforcement Agency) based on PubChem's bioactivity data and structural similarity to known controlled substances, and
(3) review what legislative/administrative actions are being considered for the identified legal highs.

This project will go through the following steps:

(1) Obtain a list of controlled substances from U.S. DEA.
(2) Convert their names into PubChem Compound IDs (CIDs).
(3) Find from PubChem their protein targets (primarily involved in the central nervous system) and binding affinities.
(4) For each controlled substance, run a similarity search to find structural analogues, which are likely to have the same biological function as controlled substances. Therefore, these are potential legal highs.
(5) Check the binding affinity of the potential legal highs against the proteins targeted by known controlled substances. If a compound have similar or stronger binding affinities for the protein target(s) than the current illegal drug, it strongly suggests that the compound may need to be the list of controlled substances, too.
(6) Discuss what compounds among the potential legal highs are currently considered for regulation by the authority.
(7) Write a policy memo about legal highs, based on what you learn from this practice.

Note that all minute details need to be discussed with students and other people during the course of this projects.

Hello Everyone!!!

Hi,
My name is Meagan Turner and I am a Biology major at the University of Illinois Springfield. I am a student and I look forward to learning more about ChemInformatics. I hopes this class will make me more well-rounded in using varies databases. I plan on using the information covered in this course to use in my future professional career as well as using this information in my current coursework.

Greetings from Leah at Cornell

Hello, everyone!

I am Leah McEwen, Chemistry Librarian at Cornell University. I manage several international projects related to chemical representation standards for many of the notations and file formats you will use in this course. Technology creation is exciting and standards help by improving machine interpretation of scientific content and enabling smooth and accurate data exchange between systems. I contributed to the OLCC Chemical Representation Module 2 with Evan Hepler-Smith and happy to follow up with any questions on this topic.

I hope everyone enjoys tinkering under the hood of chemical information!

Best,
Leah

Hello from Tuscaloosa, Alabama!

Hello,

My name is Vincent Scalfani. I am a Science and Engineering Librarian at The University of Alabama. After studying polymer chemistry in graduate school, I decided to become a Science librarian where I now teach information skills (and some basic informatics!) to chemistry and chemical engineering students.

I became interested in Cheminformatics a couple of years ago when using Jmol. Cheminformatics is so much fun and useful, particularly in libraries when we are thinking about advancing the discovery of information.

Now I mostly use Matlab for my cheminformatics projects. Matlab is a good beginner programming language and we have a lot of students at The University of Alabama using Matlab, so it gives me an opportunity to help students with their class projects too. I’m still very new to cheminformatics, but I have been able to accomplish a few neat projects like programmatically accessing chemical identifiers, creating music with InChIKeys, and finding names within InChIKeys.

I hope to learn more from the OLCC course and contribute with comments and perhaps some mentorship with a project if anyone is interested in Matlab.

Vin