Discussion | DivCHED CCCE: Cheminformatics OLCC

Truncation

Amita OK. Here is a summary. From the Landing Page, there is a text box and when you hover over it you see it changes to Search Reaxys. In Search Reaxys you may enter either: a. just natural language terms, or b. terms with truncation. (Yes, there are other things you can enter and the outcomes may be different, but right now let's keep it simple and focus on a. or b. above.) When you go down route a., Reaxys interprets the query as best it can by applying hundreds (yes, hundreds) of algorithms to try to work out what the user wants and then provides a list of options from which to choose. These algorithms include, inter alia, building queries in Substance, Reaction and/or Document Records and include automatic addition of 'synonyms'. However, when you go down route b. all these algorithms ARE TURNED OFF and Reaxys searches only in Document Records - and applies truncation or whatever else you have done including use of semi-colon (for Boolean OR). So, in Search Reaxys: nanomaterial you have just used 'natural language' and route a. kicks in. You see the first two entries say: Documents Journal. What is happening here is that one of the algorithms tracks to titles of journals, and there is indeed a Journal called Nanomaterials. So if you View Results (for the 452) you will find hits in the Journal. The last entry says 205,760 Documents Titles, Abstract, Keywords: nanomaterial. This tells you that the search has been done in Titles, Abstract, Keywords in Document Records and you can go and View Results. From these results you can work out what Reaxys searched, but there is an even easier way to find this out, namely: hover (with mouse) over Preview Results for these 205,760 Document and you see Edit Query comes up. Click this and Reaxys tells you exactly what was searched, specifically the search was done in the Document Basic Index (that is single words in Titles, Abstract, Keywords) and you can see the bunch of 'synonyms' that Reaxys automatically applied. Note a semi-colon is between them = OR. (Note: unlike for every other natural language search engine which applies algorithms and doesn't tell you what it searched, Reaxys is fully transparent, shows you what was searched and then enables you to learn from that and modify it if you want.) However, with either of Search Reaxys: nanomaterial* or (separately) *nanomaterial or (separately) *nanomaterial* route b. kicks in, and the outcomes are either right, left, or right and left truncation respectively. (Note that Edit Query in these cases has yet to be fully implemented and what you see if you go down this path has 'errors'. Don't worry. Edit Query only came out a month ago and details like this are being attended to, but in reality the correct truncation as stated above has been used/searched.) I hope you can follow. It's all logical - once you know the (simple) rules. NEVERTHELESS where I really wanted to get you was to learn from everything you did above and then think your way (eventually) to the following search. DOCUMENT BASIC INDEX: *NANO* *WASTE* AND DOCUMENT BASIC INDEX: *WOOD*; *CELLULOS* (You need to do this through Query builder, then Search properties: document basic, then drag the querylet twice into the main working space, and enter the terms above - although you don't have to use capitals.) It is easiest to build two querylets (basically you want nano AND waste, AND (wood OR cellulos) ... and it is easiest to do these separately). The first one defaults to a search *nano* AND *waste* (meaning you search nano, nanocluster, nanostructure, nanocarbon, nanomaterial, bionanomaterial, and all their plurals and so forth, AND waste, waste water, wastewater and so forth). The second one truncates both right and left and searches the terms OR. When you do this you get over 1,200 documents and you pretty much has covered all variations. Interestingly you get NANOCELLULOSE(s) and this could well be of interest. I am not sure. It seems to me that this is a super answer set that you could modify further if you wanted (e.g., if you really wanted the water bit, then you would search *water* separately and combine this search with the 1,200+ documents...if you need help then let me know). Searching the literature is an ART and a SCIENCE. We have to think - and not like a Googler. We want to do comprehensive/precise searches in quite complex scientific databases (with all types of author word variations), and we don't want to "stop at the first few pages of answers". I hope this helps. Damon PS In your answer to your assignment, or in your presentation, please understand that you will need to communicate to people who may be say Googlers, WebofScience, Scopus, or SciFinder etc users. They may not be aware of how Reaxys works, so please take your time and explain everything step by step. In the long run, and having used all these products, I think that the Reaxys way is very neat. You have only to learn a couple of things and then you think your way through the problem logically. That's true science, right?

Thank you. I got it.

You can't provide multiple chemical names as input.

It is possible to provide multiple CIDs as an input (in the form of a comma-seperated list) like this:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/887,702,1031,263/property/MolecularFormula,MolecularWeight,CanonicalSMILES,HeavyAtomCount,XLOGP/CSV

However, you can't do this with chemical name inputs, because many chemical names contain commas in them (e.g., 1,2-dichloroethan), and the server cannot tell whether a comma is used as a separator, or as a part of the name. When chemical names are used as an input, you need to process one name at a time.

Because you had a typo in your query.

It should be "chloro" not "chloor".

https://www.ncbi.nlm.nih.gov/pccompound/?term="(3-chloro-2-hydroxypropyl)trimethylammonium+chloride"

Google Sheets

My students and I have been working on the Google Sheets assignment focused on section 3.3 of this Module. I have tried several ways to pull the CID information into the C column in the spreadsheet shown in the image below. First we tried the importxml with cell B3 exporting the CIDs as XML following the example spreadsheet using -- =importxml(B3, "//*[local-name()='cid']") then I switched B3 to export as TXT and tried importdata. Neither of these is giving me cid values in column C. I'm guessing there is something simple I am missing.

Comment File:

Google Sheets Morsch.png

Search compound in PubChem

When I search (3-chloor-2-hydroxypropyl)trimethylammonium chloride, it did not get any result but when I enter Quat-188 or CAS number, I will get a result. Why do we do not get any results on compound search with a chemical name? Thank you

searching documents on Reaxys

I tried to search some documents related to my project. When I searched just entering nanomaterial, I got 452 records for nanomaterials and 205,760 records for nanomaterial. When I used nanomaterial* it gave me 153,536 records and with *nanomaterial* I got 153,651 records. I am confused about using truncation. Why do I get a greater number of records when I put "*" in both side of text? Thank you

Experimental Data

I realise this discussion has come up under a module: "Programmatic Access to Public Chemical Databases" and since Reaxys is not a free database then what is possible in Reaxys is probably irrelevant. Nevertheless I offer a couple of general comments: 1. Creating databases for experimental properties is very time consuming and hence, since someone has to do the work, then it can be quite expensive; 2. Dr Kim in his comments raises questions such as: a. some physical properties are dependent on conditions of measurement (boiling point/pressure); b. physical properties are reported in different units and the values need to be normalised, and c. different states of matters (and certainly different purities of substances) can affect the data (so often not only the data but also notes on the data need to be included). In short, to create experimental property databases you not only need massive resources, but you also need a number of rules/guidelines relating to what to excerpt and how to organise (then search) the data. The other thing to add here is that very often the substance itself is of much lesser interest than what it does (i.e., its properties). Yes, cholesterol has 27 Cs, 46Hs and an O and we can draw its structure or even visualise it in 3D (kind of - since who really knows how the side chain is coiled in different environments), but it is what it does (i.e., its properties) that really gets us scientists excited. Recently I questioned one of the people at Reaxys responsible for the excerption and organisation of physical data and he informed me that in the vast majority of instances the original data in documents needed correction, normalisation or qualification in some way or other before the data was entered in the database. While he had never done a complete analysis of the situation, he suspected that data in well over 95% of the original documents needed to be re-evaluated in one way or other. Since Reaxys has well over 400 fields of data (of which approximately half involve numeric data and half have text data) and since the extent of experimental 'property data' in Reaxys is (I estimate) around 100 times larger than any other product, you can get a sense of the magnitude of the undertaking (and of the added value) that Reaxys makes (and provides). So, yes, in Reaxys you can pull spreadsheets of experimental data out and compute the data with in-house software. Further the data pulled out can go well beyond just substances (or molecular weights) and properties such as melting/boiling points. As a couple of examples you can easily find critical superconducting temperatures of lanthanide-containing substances, or IC50 data values relating to studies of substances for pharma targets or diseases. Cheers, Damon

Only computed properties are indexed.

Currently, you can "search" PubChem by computed properties only, and it does not support experimental properties. To have such functionalities in PubChem, there are a couple of issues that should be addressed beforehand. One of the issues is that the experimental properties collected from various sources often do not contain necessary meta data (For example, the atmospheric pressure at which the boiling/melting temperature was measured). This makes it difficult to standardize the experimental values. The units employed should be standardized (e.g., Kelvin, deg-C, or deg-F). In addition, many experimental data are not really structure-specific (e.g., there are many different substances that can be represented as SiO2, but with very different properties). In theory, these issues can be addressed in some way, but we don't really have enough resources to work on this.

Sort of going the other way

It seems like most of the examples we have done are based on the options available on the Entrez system. In advanced search you can do a Boolean "and", and if you had the options, upload a list of chemicals, and find the ones that had a molar mass over a range. But can you do that for a different property, like the melting point? (I know we can find who submitted melting pts to pubchem). So the question is, can a spreadsheet pull back things like melting points? boiling points, from PubChem? If so, we may be able to do a calculation in the spreadsheet that filters out things that are outside of a given range. Cheers, Bob