In doing question 4 part C. I went and moved the end of the dendogram from 0.9 to 1.0, and am attaching the dendogram to this comment. I was surprised to see so many ones. I then looked at the first pair, 71120 and 188803, and realized, as I should have expected, that they were isomers. So does that mean that CIDs 122130020 through 18644062 are all isomers, and that compound 9865442 has around 95% similarity to them?
In part a of question 2, I believe I am following the directions word for word, but when I start the search I get "History Not Found" as a result and the tab says error encountered. I don't know if there is something wrong in the question, or if there is something I am missing. I get the same "History Not Found" message whether I do 95%, 90%, 85%, or 80%. Anyone else have this issue?
There are two perspectives on the use of Gaussian functions to describe the shape of molecules.
(1) We often use a hard-sphere model to visualize a molecule (in which a hard-sphere is placed at the position of the atoms), because it gives you a clear idea about whether an atom is located in the molecule and how big it is. Overall, the position of these hard-spheres will represent the shape of a molecule. However, this is a simplified view because an atom is *not* a hard sphere. Rather, the atom is "soft", in the sense that the distribution of electrons in an atom (represented by electron density) diminishes as it goes away from the nucleus, but never becomes zero, forming a very long tail. The use of Gaussian functions instead of hard-spheres to describe molecular shape somehow reflects this view, although it is true that the shape will be blurred. But it's the nature of electrons, too.
(2) There is a more practical reason to use Gaussian shapes over hard-spheres. When you superimpose hard-sphere models of molecules to compute the overlap between the two shapes, it is not easy to deal with the "cusp" where the two hard-spheres meet. However, when Gaussian functions are used, you can calculate the overlap between them very easily and quickly, because the Gaussian functions have a unique property that the product of two Gaussian functions are another Gaussian.
So, overall, it is true that the use of Gaussian functions will blur the molecular shapes, but it is actually close to the "soft" nature of atoms. But, the primary advantage (or reason) for using Gaussian functions is to speed up the computation of molecular overlap.
Nwume
I am not quite sure whether the background to your question is to check whether you should take the Student Project or whether you already have signed up and want to start.
If the former, then you will find the type of information you get from PubChem and from Reaxys is quite different, so it would be good for you to explore options in both databases. As you have learnt from other sections of the course, different databases organise (and excerpt) data in different ways and you really need to understand a bit about each database in order to extract the best results (for your needs).
If the latter, then I think the first thing to do is to list a couple of the mental illnesses you wish to study. Since you have already attended the course on PubChem you probably know how to proceed - although Sunghwan will certainly help if you need it. Regarding Reaxys, then as soon as you give me a list of illnesses then we can work together on a plan to understand how to proceed in Reaxys.
Damon
I made a comment about this on hypothes.is, but figured I would ask the question here about it. When applying basically a gaussian cloud around a molecule similar to how photo editors apply a gaussian blur, I'm not quite sure I understand how this helps to reveal structural information when comparing molecules for similarity. It seems that by doing so, it would make structures that were less likely to be similar in shape look more like they are similar. I understand that it also said the math involved was beyond the scope of this module and that is probably the key to understanding this, i'm just trying to visualize the overall approach of this method. I did go to the ROCS website and look around, but it still seems a little confusing.
Andrew
What you have between the InChIKey and the next CID is a line-feed character. This character is not rendered in some text editors (e.g. Windows Notepad) but it is in others (e.g WordPad). In Unix/Linux, it means the start of a new line and this is the way it is interpreted in Excel. In Windows, a newline is indicated with two characters, carriage-return + line-feed and this is way it doesn't show in notepad. More detailed information can be found at https://en.wikipedia.org/wiki/Newline. Cheers, Jordi
"[@data-serp-pos='0']" means that the XML element has an attribute named data-serp-pos which value is '0'. I don't know what the attribute means but it seems to be related to the position of the link in the search results. Thus a value of '0' would mean the first result. Cheers, Jordi
Hi Jordi,
Thanks for sharing this. I have two questions, and please pardon my not having time to look up everything ab initio, but I did spend multiple hours on this today. First, on the code,
=iferror(importXML(B2, "//a[@data-serp-pos='0']"),"InChi Key Not in Wikipedia")
can you explain what is going on, especially the [@data-serp-pos='0] part.
And second, we need to do this with Excel, as Google Docs said there was too much data. I will try and attach to this email the file I am working on, which I downloaded from PubChem, and contains chemicals with LCSS information.
I am also curious, if you look at the raw text in the file, each InChI key ends with "N", and is followed by the PubChem number, as if it was one string. But when you open it as tab delineated, in Excel, it recognizes that there is a tab there, but why? Does anyone know?
(Hopefully there will be a file attached to this comment)
Cheers,
Bob
HY , everyone my name is nwume chinenye from ualr i was directed by my instructor to talk to you about my project . i am working on different drugs that can treat mental illness using both reaxys and pubchem. please can you please help me cos i dont know where to start.
.
A simplified version following the approach Bob just suggested is available at https://docs.google.com/spreadsheets/d/1CJwl2Uz5ZQDsy0V7I0khi5IJM2mhMdMbLycaIrnIJyU/copy I also changed the URL to search only for the connectivity layers of the InChI. This way more chemicals are found although stereoisomerical and isotopical information is not preserved. Cheers, Jordi
In doing question 4 part C.
Question 2
There are two perspectives on the use of Gaussian Shape
There are two perspectives on the use of Gaussian functions to describe the shape of molecules.
(1) We often use a hard-sphere model to visualize a molecule (in which a hard-sphere is placed at the position of the atoms), because it gives you a clear idea about whether an atom is located in the molecule and how big it is. Overall, the position of these hard-spheres will represent the shape of a molecule. However, this is a simplified view because an atom is *not* a hard sphere. Rather, the atom is "soft", in the sense that the distribution of electrons in an atom (represented by electron density) diminishes as it goes away from the nucleus, but never becomes zero, forming a very long tail. The use of Gaussian functions instead of hard-spheres to describe molecular shape somehow reflects this view, although it is true that the shape will be blurred. But it's the nature of electrons, too.
(2) There is a more practical reason to use Gaussian shapes over hard-spheres. When you superimpose hard-sphere models of molecules to compute the overlap between the two shapes, it is not easy to deal with the "cusp" where the two hard-spheres meet. However, when Gaussian functions are used, you can calculate the overlap between them very easily and quickly, because the Gaussian functions have a unique property that the product of two Gaussian functions are another Gaussian.
So, overall, it is true that the use of Gaussian functions will blur the molecular shapes, but it is actually close to the "soft" nature of atoms. But, the primary advantage (or reason) for using Gaussian functions is to speed up the computation of molecular overlap.
help
Gaussian Shape Comparison
For the text file format
What you have between the InChIKey and the next CID is a line-feed character. This character is not rendered in some text editors (e.g. Windows Notepad) but it is in others (e.g WordPad). In Unix/Linux, it means the start of a new line and this is the way it is interpreted in Excel. In Windows, a newline is indicated with two characters, carriage-return + line-feed and this is way it doesn't show in notepad. More detailed information can be found at https://en.wikipedia.org/wiki/Newline. Cheers, Jordi
For the xpath expression
"[@data-serp-pos='0']" means that the XML element has an attribute named data-serp-pos which value is '0'. I don't know what the attribute means but it seems to be related to the position of the link in the search results. Thus a value of '0' would mean the first result. Cheers, Jordi
Couple of Questions
help
Simplified Google Sheets version
A simplified version following the approach Bob just suggested is available at https://docs.google.com/spreadsheets/d/1CJwl2Uz5ZQDsy0V7I0khi5IJM2mhMdMbLycaIrnIJyU/copy I also changed the URL to search only for the connectivity layers of the InChI. This way more chemicals are found although stereoisomerical and isotopical information is not preserved. Cheers, Jordi