Discussion

Ehren Bucholtz | Thu, 03/02/2017 - 10:43

The Excel implementation seems to be much faster. THis also gets around the 50 link limit. I have an excel sheet with over 100 molecules now that I am trying to convert name to SMILES through opsin. Thank you for figuring this out.On thing that I cannot figure out is that I would like to have an image in the spreadsheet from the OPSIN or Pubchem websites. THis was really easy to do with the google sheet, but doesn't seem to be allowing me to do it in Excel. For example, I can put the following in my google sheet and it will display an image from OPSIN that correlates with the name. =IMAGE("http://opsin.ch.cam.ac.uk/opsin/"&B4&".png")My simple work around is to use the google sheet with the above image, and then copy and paste into the excel spreadsheet. I have tried to find if it is possible in excel, but so far it looks like this is only a function in google sheets.

Sunghwan Kim | Thu, 03/02/2017 - 09:35

I have a similar problem.  It seems that the number of webservice calls that you can make from one google sheet is limited to 50.  Please see this document:

 

https://support.google.com/docs/answer/3093335?hl=en

 

I found this is very discouraging because many informatics projects often require hundreds or thousands of calls to import data from the web.  

 

Microsoft Excel has a function called "webserivce", which is similar to IMPORTDATA in google sheet.

 

=WEBSERVICE("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/cids/txt")

 

OR if you need to use a cell identifier (e.g., A4) in the URL:

 

=WEBSERVICE(CONCATENATE("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/",A4,"/cids/txt"))

 

But I also found that excel has its own problem (e.g., not easy to share with other people).  So, which is better would heavily rely on the nature of your project.

 

Sunghwan Kim | Thu, 03/02/2017 - 09:21

Well, I believe both OPSIN and OpenBabel give you canonical SMILES.  But these canonical SMILES may not be the same as canonical SMILES from Daylight or other programs.  (That is, all SMILES generation programs can generate what they call "canonical" SMILES, but  this canonicalization algorithm itself has been implemented differently by individual software developers, the resulting canonical SMILES are different, too.  But they are still called canonical SMILES.)

    My impression is that you are thinking the term "canonical" is sort of "universally accepted by different SMILES developers".  However, "canonical" here actually means "canonical only for a particular software developer".  Overall, there is *no* gaurantee that different SMILES generators would give you the same canonical SMILES for the same molecule.

Sunghwan Kim | Wed, 03/01/2017 - 20:44

Hi, all.

Actually it is an expected behavior of IMPORTDATA.  It imports data from a given URL in a .csv (comma-separated values) or .tsv (tab-separated values) format.  (Please see this document) As we know, InChI contains many commas, so the IMPORTDATA fuction interprets it as an array of values separated by commas.  This is not a bug, but an expected behavior of the function.

Other line notations like SMILES and InChIKeys are not likely to have this issue because they don't use commas.  However, I do expect the same problem if you try to get systematic chemical names (which use commas a lot) through IMPORTDATA function. And the same for any cases in which you expect any commas in imported data. 

The solution suggest by Jordi essentially changes the expected output format from .csv to .tsv, so the commas are not interpreted as a field separator.

Sunghwan Kim | Wed, 03/01/2017 - 19:51

Hi, Phuc, All SMILES strings mentioned in your comment are equivalent.  As covered in Section 2.3, many equivalent SMILES strings exist for the same molecule, so that is the reason why we have canonical SMILES.  However, SMILES itself is proprietary and not an open project, even canonical SMILES from one program is not the same as those generated for other programs.  (That is the reason why we have InChI). **** By the way, there is a quick way to check whether a SMILES string you write is correct or not.  Input that SMILES string into any molecule editor and see how the program interpret it.  For example, you can copy and paste your SMILES to PubChem Sketcher and hit the enter key (https://pubchem.ncbi.nlm.nih.gov/edit2/index.html?cnt=0).  Then, you will see your SMILES for benzoic acid are interpreted correctly.  Have a good night. Sunghwan,   

Robert Belford's picture
Robert Belford | Wed, 03/01/2017 - 19:17

Phuc, and all,

This is a very good question, which hits a problem with our education system and how we grade students, in that students become conditioned to questions for which there is one correct answer, and we grade them based on that. Although being able to answer those types of questions is important, there is a limit to the types of knowledge they can assess, and if you ask me, doing science requires more than answering questions for which the answer is known.

I will not answer your question, but suggest you take these different smiles strings to a resource like the NCI chemical identifier resolver, and answer your own question.  :-)

You have actually hit one of the big challenges with teaching a course like this, as in many ways we are trying to teach students how to do science.

Cheers,
Bob

 

Ehren Bucholtz | Wed, 03/01/2017 - 19:11

It appears that you are getting correct possible SMILES for benzoic acid. The problem with SMILES is that there are many flavors of SMILES. If I remember correctly Daylight Chemical Information Systems defined SMILES first, but since it is proprietary, other versions were made that were different. By using different algorithms you can have different SMILES, and applying a Morgan algorithm can result in a more canonical form. It looks like the PubChem form is the canonical form, which is also the Daylight SMILES string. If you put your generated SMILES into the Online SMILES tranlator and Structure File Generator at https://cactus.nci.nih.gov/translate/  you can see that your SMILES and the ChemSpider SMILES resolve to the PubChem SMILES. The website appears to use the Daylight 1989 definition. As for having multiple SMILES, it all depends on what you determine is the first atom in the molecule, and number from there. This is all why InChI was developed to have a non-propietary format and algorithm as well as open source software. I like that SMILES are much more readable for simple molecules, but when you get to a complex molecule like morphine, neither the SMILES or InChI is particularly human readable:InChI=1S/C17H19NO3/c1-18-7-6-17-10-3-5-13(20)16(17)21-15-12(19)4-2-9(14(15)17)8-11(10)18/h2-5,10-11,13,16,19-20H,6-8H2,1H3/t10-,11+,13-,16-,17-/m0/s1Canonical SMILES: CN1CCC23C4C1CC5=C2C(=C(C=C5)O)OC3C(C=C4)OMy take is that if neither is partiularly human readable, better to have an non-propietary standard like InChi.Good luck on the exam!

olcc s16 | Wed, 03/01/2017 - 18:35

HiI am studying for the exam tomorrow. Dr. Belford would expected us to know how to generate a SMILES and I doing some practice. I am trying to generate SMILES for Benzoic acid and come up with c1ccccc1(C(=O)O) but i look up at SMILES at ChemSpider giving me c1ccc(cc1)C(=O)O and Pubchem giving me C1=CC=C(C=C1)C(=O)O. My questions are does my generated SMILES equivalent to ChemSpider and Pubchem? and also since naming benzene with substituent, I can mark any carbon of benzene as carbon 1 and start connecting onward, I can also come up with this SMILES for benzoic acid, c1cccc(C(=O)O)c1. So is what i thought is correct?Thanks,Phuc

OLCC S53 | Wed, 03/01/2017 - 17:54

Yeah, we have our sheet pulling various naming protocols from smiles to inchikey, inchi itself is the only one that comes back with an error because for some reason google spreadsheet reads the , as a "tab" function. We were trying to see if anyone else knew anything about that or around it.  

Jordi Cuadros's picture
Jordi Cuadros | Wed, 03/01/2017 - 16:24

Hi Ehren,

As far as I know this second argument is undocumented. It refers to the character used as column separator where "\t" means a tab character. Try using any other character instead.

This second argument is common in similar functions in Goggle Sheets and other systems, i.e SPLIT function.

Cheers,

Jordi