Table of Contents
1. Teaching information literacy skills
3. Smart Sheet Video Tutorials
4. Biology Underlying Legal Highs
5. Exploring Chemistry Apps
6. We've got to do... something about CO2
7. Molecular Visualization on the Web
9. Exploring InChI Differences
11. WikiData Project
12. Optical Structure Recognition
13. PubChem and Reaxys: Complementary Databases?
14. Extending Chemical Structure Search Capability using R and Open Source Application Programming Interfaces (APIs)
Students and faculty need to subscribe to comments on projects they wish to be involved with. Once you log in, you can edit these pages like a wiki, you can upload files, subscribe to comments, and subscribe to updates. It is hoped that students across different campuses can use these project pages to collaborate with each other, and mentors.
By looking at the box on the right which say's "Following Updates" you can see who is participating in the project discussions.
PLEASE CONTACT Bob Belford, firstname.lastname@example.org if you have any project suggestions or questions.
Potential Project Proposals: Please post proposals here, or email them to Bob Belford, email@example.com. Each Project will get its own homepage.
- US Government's Open Data Projects - US Government's Open Data, https://www.data.gov/, could provide a lot of interesting projects. See comment below by Dr. Kim.
- eChemistry with Google Sheets and R or Python: Develop a library of eScience functions for chemists using Google Sheets and then use these to solve a chemical problem. Students will pick specific problems to solve that are related to their research and future interests. Contact Dr. Belford.
- PubChem, Chemical Inventories and Chemical Safety: The issues of chemical safety are ubiquitous, and this project seeks to finds ways to connect resources within pubchem to a chemical inventory, with an emphasis on safety. Contact Dr. Belford
- Linked Histogram and Data Visualization. Data Validation is one of the primary challenges for digital data literacy. Different databases can give different values for the same number. Maybe a histogram could be generated from a Google Sheet that extracts data from different databases, and then shows the spread of the values, with each bar being linked to the original source. Contact Dr. Belford
- MatLab Project?
- Optical Structure Recognition
- ChemWiki/LibreText cheminformatics Educational Tool
- Any Ideas??? Please contact Dr. Belford, firstname.lastname@example.org, and we will add them to this list.
Student Project Pages (Following will be links to actual collaborative project pages). These pages can function as a wiki style collaborative workspace where students can discuss projects, upload files and collaborate across the internet. Students must self-subscribe and be logged in to see these.,
Potential Project: Biology underlying legal highs
This potential project aims to:
(1) provide an overview of biology underlying controlled substances (i.e., illegal drugs),
(2) identify potential "legal highs" that has not been regulated by the authority (e.g., U.S. Drug Enforcement Agency) based on PubChem's bioactivity data and structural similarity to known controlled substances, and
(3) review what legislative/administrative actions are being considered for the identified legal highs.
This project will go through the following steps:
(1) Obtain a list of controlled substances from U.S. DEA.
(2) Convert their names into PubChem Compound IDs (CIDs).
(3) Find from PubChem their protein targets (primarily involved in the central nervous system) and binding affinities.
(4) For each controlled substance, run a similarity search to find structural analogues, which are likely to have the same biological function as controlled substances. Therefore, these are potential legal highs.
(5) Check the binding affinity of the potential legal highs against the proteins targeted by known controlled substances. If a compound have similar or stronger binding affinities for the protein target(s) than the current illegal drug, it strongly suggests that the compound may need to be the list of controlled substances, too.
(6) Discuss what compounds among the potential legal highs are currently considered for regulation by the authority.
(7) Write a policy memo about legal highs, based on what you learn from this practice.
Note that all minute details need to be discussed with students and other people during the course of this projects.
I would like to participate
I would like to participate in this project.
Please email me for further discussion.
Hi, there. Would you please send me an email (with cc'ing the instructor at your school) to discuss this potential project further? My email is kimsungh at ncbi.nlm.nih.gov
We want one more student for this project
The student who showed interest in this project was actually a faculty member at a participating school, who wants to learn cheminformatics. Both she and I want young students to have an opportunity to get some hands-on experiences from this project. So, we want to have one more person (either undergraduate or graduate student) that is willing to work with us (as a team of three). Please let us know if you are interested in this project or if you have questions about it. Thanks.
Interested in Project
I would be interested in working on this project. I am a student at SDSU.
Hi, I am the other faculty
Hi, I am the other faculty working on this project. Would you mind send me an email to kedan.he at centre.edu?
I will shortly set up a project page for this project, and all interested participants can communicate by subscribing to it. I will try and get that up by tomorrow.
I will shortly set up a project page for this project, and all interested participants can communicate by subscribing to it. I will try and get that up by tomorrow.
Interested in legal highs project
I am also interested in the biology of legal highs project. I am a student @ UALR.
Would you mind send me an
Would you mind send me an email to kedan.he at centre.edu?
Project: chemical name-structure association clean-up algorithm
During this course, you will encounter many cases where a chemical name does not match its structure in a chemical database. For example, some chemicals whose names contain a string "sodium" do not have any Na atoms, and chemicals with "acetate" in its name does not have an acetate unit in its structure. It sounds very awkward to somebody but those cases do exist in many chemical databases, and we will discuss this topic very frequently during the course.
The proposed project is about developing a dictionary-based algorithm that identifies potentially incorrect chemical name-structure associations. This algorithm will consist of the following steps.
(1) Generate a list of common chemical fragments' (or units') names and structures (in SMILES).
(2) for each chemical fragment (let's take "fumarate" as an example), repeat these:
(2a) Retrieve all compounds whose name contains the string "fumarate". Because the retrieved compounds have "fumarate" in their names, they are expected to have "fumarate" unit in their structures, too.
(2b) Run a substructure search against the compounds retrieved from (2a), using the SMILES string for "fumarate" as a query. The resulting chemicals have the "fumarate" string in their name and the "fumarate" unit in their structure, so the name-structure associations are considered to be correct.
(2c) Take the difference between the results of (2a) and (2b). The difference corresponds to the structures whose names have the "fumarate" string, but which don't have "fumarate" structure unit. Therefore, the name-structure associations in these compounds are potentially incorrect.
(2d) Analyze the name-structure associations for chemicals from (2c). This step intends to identify some exceptional cases (e.g., compounds whose name contain a string like "calcium-free" are not likely to have "calcium" in its structure, although the name contains the string "calcium".)
This project is pretty straightforward, with a little bit of programming skills (which I think one can learn within a week or two), although making a chemical fragment name-structure list would be somewhat tedious. However, this tackles a very important issue in cheminformatics, so I consider the resulting algorithm will be very practical.
3-D Molecular Similarity assessment for European Orphan Drugs
If a drug gets a marketing authorization in Europe with orphan designation (meaning that it is approved for rare diseases), it will get a market exclusivity for 10 years (meaning that no "similar" drugs for the same indication cannot enter into the market). Please see this document for more details:
Therefore, the European Medical Agency, which is responsible for marketing authorization of medical products, requires the applicant of a new drug to submit a "similarity report", which compare the new drug with existing drugs in terms of molecular structure, mechanism of action, and indication. For more details, see Section 2.1 of this document:
While the molecular structure similarity comparison is required for drug approval, molecular similarity is a very subjective concept, and no standard way to evaluate it.
For this reason, some papers have analyzed molecular similarity among approved drugs using several 2-D similarity methods:
However, these studies evaluated molecular similarity using 2-D similarity methods, and if 3-D similarity methods are used, we will have some different insights on similarity assessment for EMA's orphan designations. This study will take the following steps:
(1) Get all approved orphan drugs from the European Medicines Agency
(2) Retrieve all known drugs from a public database (e.g., PubChem, DrugBank)
(3) Generate 3-D conformers for the drugs in (1) and (2)
(4) Compute 3-D similarity scores between the drugs, using the 3-D conformers generated in (3) and several 3-D similarity methods.
(5) Compute 2-D similarity scores between the drugs, using commonly used 2-D fingerprint methods.
(6) Identify drug-drug pairs with a low 2-D score but with a high 3-D score (meaning that the two drugs are similar in 3-D but not in 2-D).
(7) Identify drug-drug pairs with a high 2-D score but with a low 3-D score [that is, opposite to (6)].
(8) Discuss the difference between 2-D and 3-D similarity in recognizing molecular similarity.
(9) Discuss potential impacts of using 3-D similarity methods for EMA's similarity assessment for marketing authorization.
(10) Discuss how EMA's and FDA's regulations are different in terms of orphan drug marketing approval.
This project is quite straightforward, but would take more time than other projects, because 3-D similarity comparison takes longer than 2-D similarity comparison.
Visit data.gov to find some interesting data for proejcts.
If you are not sure what to do for your projects or if you want to find one on your own, please visit the government's open data site (https://www.data.gov) to see what kind of information is publicly available on the web.
This site has data sets generated by the federal, state, and local governments, many of which are related to chemicals in some way or another, in many different areas (including food & drug safety, environmental health, atmospheric science and so on). You can search this site using a simple text query like "chemicals", "drugs", "pesticides", and other chemistry related keywords. I hope that you can find some interesting data set or sets that you want to analyze for your projects. Even if you don't find any data set for your projects, you will realize that taxpayers' money has been used to generate a gazillion amount of data that reflects various aspects of our life, but that many people are not aware of the existence of these data in the public domain.
By the way, I think these two sites may interest you too.
I hope you find them interesting.
Automated identification of potential multi-target ligands
I have recently written a paper about how to use PubChem to identify potential multi-target ligands for subsequent in silico or in vitro experiments (which means small molecules that simultaneously bind multiple protein targets). (The paper hansn't been published yet, so I will share it with those who show some interest in this project.) While the protocol described in this paper use PubChem's web-based tools and interfaces only, it can be implemented in a computer program using PubChem's programmatic access. So the proposed project is to write a program that identify potential multi-target ligands from PubChem and download it on user's computer. Ideally, it would be useful if we also develop a web tool associated with this program. Please let me know if you are interested in this project.
Substances with specific properties: search & visualize results
Project proposal from Damon Ridley:
Arguably chemistry is more about the properties of substances than the substances themselves. Sure, we know that ibuprofen has 13 carbons, 18 hydrogens and a couple of oxygens and we know silver is an element, but the fact that ibuprofen is a great anti-inflammatory agent and silver is the most electrically conductive element really gets us chemists interested.
Most databases focus on bibliographic and/or substance/reaction records, and few focus on properties. Further those property databases that exist often cover a few properties, and few property databases allow us to perform a precise search for a property and then obtain the substances with the property.
So in this Project we are going to focus on properties – and the properties that interest you! We can approach this in various ways. One way would be to discover the main databases with property information, and then learn the tricks of the trade in searching for properties. Another way would be to pick a property and find out about the substances that have these properties.
Once we go down this path we’ll find a bunch of specific facts about specific substances, and, if our interest is in visualization of results then we could think of ways to present our data – or even make our own webpages.
So, here’s the deal. You define your project and you define its scope. The only condition is that it must focus on properties.
Sound tough? Could be! But be brave and give it a go. We’ll help you …
We've got to do... something about CO2
Project proposal from Damon Ridley:
Hey, this video got me thinking:
What are these guys doing? What have they made? How can I find out? – well, if I cannot find their actual science, then what chemistries are involved with treating carbon dioxide as a “waste management problem”?
So, how do I search for this? Where? Things that interest me may be adsorption, or is it absorption? How about chemisorption? Can I find sorption diagrams, or enthalpy of adsorption? Desorption? Anything about electrochemical fixation of carbon dioxide?
Let’s try to answer some of these questions. Quickly!
How do you build and optimize a strategy to find information?
At Elsevier, our searches in Reaxys are influenced by our knowledge of the content, structure and search options of our product. It is clear to us, however, that Reaxys users may approach finding information in Reaxys differently. So, we want to learn from you — users of Reaxys. How would you teach others to leverage the search capabilities of Reaxys? If you take on one of the following projects (or have an idea for another similar project), our hope is that you will explore the content and functions of Reaxys and develop some best practices or “tips and tricks” that help other users to take full advantage of what Reaxys has to offer.
OPERATORS AND TRUNCATION IN SEARCHES
Most of us know the Boolean operators AND, NOT, OR. In a search, these operators offer different ways of linking together query terms and specifying how the terms relate to the hits that result. In the same way, NEAR, NEXT and PROXIMITY help refine the input criteria for a search. Truncation also serves to optimize a search strategy, opening the possibility of finding a broader range of information connected by a steadfast commonality, such as the stem of a word. Another form of truncation is entering ? or * in a formula to be searched in Reaxys.
(1) How does Reaxys interpret operators and truncations?
(2) What is the impact of these different operators on the outcome of a search?
(3) How exactly does a hit set change depending on what operators or truncation are used?
(4) What rules can one follow in their use?
(5) How are operators implemented in other search engines/databases?
Prepare a set of screencasts to explain the role of operators and truncation in search strategies.
BUILDING AN EFFECTIVE SEARCH STRATEGY
We are all used to the ease with which we enter a phrase into Google and get relevant answers to our question. At the same time, we also know that the long list of hits that emerges includes a lot of irrelevant results and we rarely go beyond the first couple of hit pages. A search for scientific information can be quite complicated. Ideally, we build a search to retrieve only relevant hits, but at the same time ensure that the answers we get are comprehensive.
A search strategy is our approach to finding answers to a question. In a natural language search engine, that approach may be figuring out the best way to phrase our question. In a user interface like Query builder in Reaxys, that involves figuring out what type of information we are looking for, what terms should we query, and how will we connect the search fields used. Another aspect of a search strategy may be any form of processing we do with the results from a search -- like combining hit sets from 2 or more searches, filtering or analyzing hit sets.
Defining a search strategy can be difficult. it is influenced by the type of question asked, the type of search engine or system used, and the knowledge context of our question -- a search for a particular reaction can be approached in different ways depending on what we know about the reaction itself.
So how do you build a search strategy? What steps do you follow? How do you inform your approach and where to you find the right query terms to use?
Pick a question and show how you can use Reaxys to optimize your search strategy to answer the question. Just to narrow the scope of the project, focus on a specific type of search:
(1) search for information on a particular chemistry topic
(2) search for a substance or group of substances that meet certain criteria
(3) search for properties of substances that would help you identify an unknown
(4) search for a reaction and figure out how to optimize it
Use screencasts to show your thought process and the steps in generating optimal sets of answers. Based on your exploration, build guidelines you can share with others for building effective search strategies.
Teaching information literacy -- a skill for life
An area that we have been exploring at Elsevier and where we can learn a lot from users is teaching information literacy — at any level of education. We would like to hear from those of you who are thinking about novel ways to engage students in thinking about scientific information and the importance of knowing how to navigate, evaluate and use that information. You may be just beginning your training towards becoming an instructor or you may be a veteran in education with a lot of experience in managing the “roadblocks” to teaching fundamental information literacy skills. Better yet, both could work together on this project.
The Association of College and Research Libraries defines Information literacy as "a set of abilities requiring individuals to recognize when information is needed and have the ability to locate, evaluate, and use effectively the needed information.” These skills are platform-agnostic so they must be generalized, independent of where information is searched and in which discipline the information is used. So what is the skill set that an individual must learn? How can these be taught within the context of how people already search, evaluate and use information? And how can available information management systems be used in the classroom setting to teach and foment these skills? You could:
(1) pick an information management system available to you (Reaxys, PubChem, SciFinder, Google) and develop lesson plans using that platform to give students hands-on experiences that teach the skills you consider important.
(2) develop the list of skills you consider important. Then pick two or more information management systems and compare and contrast how those skills lead to optimal information searching, evaluating and using in each platform.
To this end, allow us to bring to your attention a chapter written by Librarian Judith Currano where she discusses her experiences with and approach to teaching information literacy. You will find the document uploaded to this page. Another resource that might be of use is a webinar Currano did with Damon Ridley. The link to the recorded webinar is: https://attendee.gotowebinar.com/recording/9122161258764560642
chapter from Judith Currano?
I share your interest in teaching information literacy, so I was hoping to read the chapter by Judith Currano. But I don't see a file uploaded to the page on Drupal. Maybe something happened to the attachment?
my apologies for the delay in my response. I uploaded the chapter again and this time made sure it appears. Please do share any thoughts you have!
Potential Project: Exploring Chemistry Apps
Increasing number of Chemistry Apps are appearing on tablets and mobile phones. They can be used to view a molecule in 3D format, to do molecular simulation, to manage literature, to work as interactive learning tool, to act as lab reference and tool. In this project, a participant can pick one of the six categories of Chemistry Apps (reference/study guide, molecular viewer, research, utilities, periodic table, games), find the best app(s) and critique. The participant can find an scenario that could be helpful in a particular Chemistry course, a lab environment (for a specific experiment) or a research project where chemistry apps can be used extensively and effectively. One JChemEd review paper, a book chapter and a website with Chemistry apps review table will be provided. The student is encouraged to contribute to a wiki-like website to review the latest chemistry apps, update information, and correct any errors on the wiki-site.
Project Ideas- Optical Structure Recognition
Project Idea: Optical Structure Recognition
This is a project that I have been working on, and thought I could incorporate students into the project. Students in this course have varying levels of skills and abilities, depending on your skill levels, students might take on different aspects of this project if interested.
The project is based off of the OSRA program developed by Igor FIlippov at NCI.
and Igor’s original goal was to create software that could read structures from published PDFs to extract the image data as SMILES strings.
and you can test it out here:
you just upload a structure that you have drawn in a chemical drawing program that was subsequently saved as a gif or jpg.
I have contacted Igor, and in addition to some of my ideas, he has some great ideas that would be perfect for student projects.
I saw this as a way to remove the tedium of hand grading papers. I thought it would be great to have my 200 students answer a quiz question by drawing and submitting papers. I could have the software scan and see if they are correct. There seems to be evidence that handwriting improves learning. Students also draw structures when they do homework, but for large classes, we often test with multiple choice. Eventually, I would like to see if students have better learning outcomes if we stick to all drawing for example.
I have been testing this out with a subset of structures, and am finding out the issues with hand writing recognition within the program.
One of the key aspects for this part of the project would be to develop a validation set of molecules for testing.
The following is a link to a Google Sheet that has what I have in mind for this aspect of the project:
The project would be to set up a list of alkenes, branched alkenes, cycloalkenes, alkynes, branched alkynes,alcohols, branched alcohols, alkyl halides (bromides, chlorides, fluorides), ketones, carboxylic acid derivatives (acid chlorides, esters, anhydrides, thioesters, amides), amines and thiols. You can see from the google doc above that stereochemistry can and should be included. One of the aspects would be to use InChI Key data for determining if a correct structure was drawn. For example, if you have 4-methyloctane, it can be racemic, R or S. The first 14 letters of the InChI key don’t change. So, in the python grading program that I have written, one can award partial credit for drawing correct connectivity, but incorrect stereochemistry. I know that OSRA can handle wedges and dashes, but how well and if hand written, I am unsure.
I have written a program in python that after a student paper is scanned in, it checks each structure and compares the SMILES string to a known SMILES string. I think using InChI might be better for this as indicated in part 1a. The aspect of this part of the project would be to draw the molecules in the validation set by hand and then scan them in and see determine the error rate for SMILES determination. Factors that I have found complicate this step include- eraser marks, quality of lines drawn, sloppy drawing. In this part of the project we would look at what makes a good structure for analysis (whether hand drawn or computer drawn).
OSRA has an algorithm for dealing with bicyclic or bridged compounds. The question is how good is it as handling them. I know that morphine is in the original validation data set, but when I have tried determining its SMILES from an image, it seems to get it wrong. This part of the project would be to use PubChem or Reaxys to identify bicyclic/bridged molecules (such as adamantane) and determine what is the success rate.
Just like compounds, reactions can be identified by smiles (rsmi). You can see an example and explanation of rsmiles here:
In the last few versions of OSRA (1.4.0 and higher) this capability has been added, but when I spoke with Igor last week, he indicated that this aspect of the program has not been fully vetted.
Igor has suggested we put together a collection of a few hundred images from literature and/or patents. We would have to make a database of the images with starting material, reagents, and products. Then we would need to determine what the accepted rsmi should be for the reaction, and then compare the OSRA output to the known.
Also added in the latest version 2.1.0, polymers have been added to the OSRA program. Igor also indicated that this aspect of the program also has not been fully vetted. Just like in project 3 above, we would have to put together a database of polymers and test to see that OSRA properly determines the output. He suggested that we look at the latest version 1.05 of InChI. I am guessing we could try to see what they used as a validation set and we can use that as a starting point in OSRA.
OSRA is not the only open source optical structure recognition program available. https://github.com/ggasoftware/imago Imago is also free. There is also a commercial software called CliDE http://www.keymodule.co.uk/products/clide/index.html it might be interesting to compare how OSRA does compared to these other packages.
Long story short- there are lots of projects that can be based around optical structure recognition. I also believe that some of these projects could result in publication if any students are looking to make posters for ACS meetings, or even a publication in J. Comp. Inform. Modelling for validation of OSRA, or J. Chem. Ed for the handwriting analysis.
I just put together a video demonstration of my program.
There was a small error in the program, and I may have sounded a little confused after seeing that it found 3 alkenes, and if you look closely at the original document, it had a cycloalkane that didn't get converted. The issue was that I extracted the same molecule from the page twice. I wasn't trying to hide a mistake. When I went back and re-ran the program with the correct extraction pixel settings, it got the correct smiles/structure for 1-ethyl-2-methylpentane as well.
Potential project: PubChem and Reaxys: complementary databases?
Project proposed by Damon Ridley, Sunghwan Kim and Anja Brunner.
The wealth of information available creates information overload problems, and more than ever it is helpful if scientists understand the core collections in their field and know which sources to use and when. At some stage, evaluations of the different options need to be made, and since elsewhere in this course PubChem and Reaxys have been introduced, we now have the opportunity to evaluate them together.
We shall explore answers to questions such as: in what ways are they similar and in what ways are they complementary; what is their combined landscape, and how are the different systems searched?
Those interested in teaching chemical information retrieval may wish to explore the “big picture”, i.e., the overall content and search functionality of PubChem/Reaxys; those interested in finding information in their special field of study or research may wish to explore subject-specific information.
Evaluating systems is fraught with numerous difficulties pertaining not only to the database(s) but also to the knowledge and skills of the searchers. In this project we shall work as a team, with different participants evaluating areas of their choice. The outcome should be of interest to everyone in the OLCC program … and beyond …