Discussion | DivCHED CCCE: Cheminformatics OLCC

Questions 3 and 4

On questions 3 and 4, when searching the database for compound 5090, I am getting the exact same result for all of the searches. It takes me directly to the page for the compound Rofecoxib. I wanted to make sure I wasn't doing anything wrong.

Solubility database

For my class project, I'm looking for a database with specific solubility data. Does anyone have know of any potential databases?

finger prints and identifying functional groups

Let me followup on this student's question. One of the suggested projects was to figure how to determine the functional groups within a molecule from it's InChI. My understanding is that the InChI canonicalization algorithm does not use functional groups, and that is not directly possible, but that an InChI could be converted to a mol file, or some other format, and that could be done. This lead to the question of could a finger print be used to identify functional groups, and if so, could you make a fingerprint out of any form of a chemical identifier (InChi, SMILES,....), and know the functional groups? Instead of using a Tanimoto coefficient to use the fingerprint to determine similarity, could you simply have a yes/no answer, and how many? Could you make some sort of comparison matrix, where you compare the identifier to a series of fake molecules,where each fake molecule represent specific functional groups? If you get a 1, it is there, if you get a zero, it is not. (I am not clear on how you count them). Cheers, Bob

Answers to Student's questions that disappeared.

It seems that somebody posted questions to this page, but somehow they disappeared (probably due to some technical issues?). I’m re-positing the questions with my answers. >>>> (1) How do you access a molecule's fingerprint from PubChem? I can do similarity searches, but I cannot find specific molecular fingerprints. You can find it from a 2-D structure file downloaded from the molecule’s summary page. Visit this page (for CID2244, aspirin) and click the download button on the top-left corner. You will see several download options from the menu. Click the “Display” button for the SDF file format under “2D Structure” section. It will display relevant data on the web browser. The fingerprint appear under the PUBCHEM_CACTVS_SUBKEYS section. > "PUBCHEM_CACTVS_SUBSKEYS" ! The brakets have been changed to the quotes to avoid conflicts w/ html. AAADccBwOAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAyDoAABgCIAiDSCAACCAAkIAAIiAEGCMgMJzaENRqCe2Cl4BEIuYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA Note that the fingerprint in the file is converted into a base-64 integer, meaning that you will need to convert it to a binary format, as explained the PubChem fingerprint description file (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt">ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt</a>). ==== Beginning of excerpt from the file ==== When exporting fingerprint information in the SD file format, the SD tag for the PubChem Substructure Fingerprint property is "PUBCHEM_CACTVS_SUBGRAPHKEYS". The PubChem Substructure Fingerprint is Base64 encoded to provide a textual representation of the binary data. For a description of the Base64 encoding and decoding algorithm specification, go to: <a href="http://www.faqs.org/rfcs/rfc3548.html">http://www.faqs.org/rfcs/rfc3548.html</a> ==== End of excerpt from the file ==== You can download these data in different file formats, from individual compound summary pages or from the download facility (<a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a>). You can download them from the PubChem FTP site (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem">ftp://ftp.ncbi.nlm.nih.gov/pubchem</a>). It is also possible to get them programmatically. For example, you can use a PUG-REST request like: <a href="https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/SDF/?record_type=2d&response_type=display">https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/SDF/?record_type=2d&response_type=display</a> More detailed information on PUG-REST can be found in this paper (<a href="http://nar.oxfordjournals.org/content/43/W1/W605.long">http://nar.oxfordjournals.org/content/43/W1/W605.long</a>). Please see Figure 2 of this paper to get some idea about how to formulate PUG-REST requests programmatically. >>>> (2) Is it possible to identify functional groups for algorithmically created chemical identifiers by transforming them to molecular fingerprints (and analyzing the fingerprints instead of going through mol files)? It is not clear what you are asking exactly, but my guess is that you are asking whether you can figure out if a compound has a particular functional group, by checking the compound’s fingerprint. In theory, it is possible, if your fingerprint definition includes a fragment corresponding to the functional group of interest. However, in practice, you can do the same thing simply by searching the database using a SMARTS string as a query.

Recent papers about several chemistry databases

The Oxford University Press publishes a journal called “Nucleic Acids Research”. Every January, this journal publishes a special “Database” issue that covers a wide variety of databases useful to their readers. It also maintains/updates a list of important databases so that people can see what databases are available to the scientific community. Please visit this database collection available at: <a href="http://nar.oxfordjournals.org/content/39/suppl_1/D1/suppl/DC1">http://nar.oxfordjournals.org/content/39/suppl_1/D1/suppl/DC1</a>. Several Database Issue papers that cover the databases mentioned in this class have been published in this journal as advanced articles. They are: <a href="https://doi.org/10.1093/nar/gkv951">https://doi.org/10.1093/nar/gkv951</a> (PubChem) <a href="https://doi.org/10.1093/nar/gkv1031">https://doi.org/10.1093/nar/gkv1031</a> (ChEBI) <a href="https://doi.org/10.1093/nar/gkv1047">https://doi.org/10.1093/nar/gkv1047</a> (PDBe) The following papers also deal with popular chemistry databases. I suggest that you skim through the title and abstract of these papers to see what kind of information is available to the public. <a href="https://doi.org/10.1093/nar/gkv1072">https://doi.org/10.1093/nar/gkv1072</a> (BindingDB) <a href="https://doi.org/10.1093/nar/gkv1075">https://doi.org/10.1093/nar/gkv1075</a> (SIDER) <a href="https://doi.org/10.1093/nar/gkv1037">https://doi.org/10.1093/nar/gkv1037</a> (The guide to pharmacology)

The fragment counts can be encoded into the fingerprints.

Although there may be several approaches, I will explain what is being used in PubChem. In PubChem’s 2-D similarity method, the number of a particular fragment is encoded into the fingerprint definition. As an example, please check the PubChem fingerprint definition available at: <a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt">ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt</a> The first four bit positions in this definition are: === Section 1: Hierarchic Element Counts - These bits test for the presence or count of individual chemical atoms represented by their atomic symbol. Bit Position Bit Substructure 0 >= 4 H 1 >= 8 H 2 >= 16 H 3 >= 32 H === According to this definition, the first four bits of a compound with 40 hydrogen atoms will be “1111”, and those of a compound with 10 hydrogen atoms will be “1100”. In this way, you can distinguish compounds with a particular fragment from compounds with a different number of the same fragment. It should be noted that you may choose a different fingerprint definition. For example, rather than using >= 4, >=8, >=16, >=32 hydrogens for the first four bits, you may choose >=8, >=16, >=24, >=32, >=40 hydrogens for the first five bits (note that there is one additional bit for >=40 hydrogens). Of course, you can choose *not* to make any distinction between different hydrogen atom counts. With that said, it should be noted that different groups have devised different fingerprint definitions to meet their own needs, and the PubChem fingerprint is only one of them. To answer your original question about the number of benzene ring fragments, please check the fragment definition for bit positions at 178-212 (from the PubChem fingerprint definition file through the link above). 179 >= 1 saturated or aromatic carbon-only ring size 6 ... 186 >= 2 saturated or aromatic carbon-only ring size 6 ... 193 >= 3 saturated or aromatic carbon-only ring size 6 ... 200 >= 4 saturated or aromatic carbon-only ring size 6 ... 207 >= 5 saturated or aromatic carbon-only ring size 6 According to this definition, compounds with four benzene rings will have a bit sequence of “1...1...1...1...0” and compounds with one benzene ring will have a bit sequence of “1...0...0...0...0”.

Multiple Fragments

Dr. Kim, Does the Tanimoto coefficient take into account multiple copies of a particular fragment? What happens if you compare a molecule that has four benzene rings with a molecule that has only one benzene ring?

Programmatic access to PubChem

Many databases provide programmatic access routes. For example, PubChem provides several programmatic methods, including (1) E-Utilities, (2) Power User Gateway (PUG), (3) PUG-SOAP, and (4) PUG-REST. Among these methods, PUG-REST is the simplest and easiest to learn. PUG-REST encodes into a one-line Uniform Resource Locator (URL) all information necessary for accessing a particular information in PubChem. For example, the following URL retrieve data for Compound ID (CID) 2244 and 1983 from PubChem in XML format: <a href="http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244,1983/record/XML">http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244,1983/record/XML</a> An introductory material on programmatic access to PubChem is available on the following link: <a href="http://nar.oxfordjournals.org/content/43/W1/W605">http://nar.oxfordjournals.org/content/43/W1/W605</a> Alternatively, you can download a set of records that you are interested in, using the PubChem Structure Download and Assay Download tools. <a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a> (for structure download) <a href="https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi">https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi</a> (for assay data download) The structure download tool allows users to download up to 500,000 compounds. To download more than this number, you can chunk your compound list into small pieces of 500,000 compounds. However, if you have a very long list of compounds to download, you should bulk download from the PubChem ftp site (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/">ftp://ftp.ncbi.nlm.nih.gov/pubchem/</a>).

Discussion on APIs will

Discussion on APIs will arrive in Module 8. We'll get there.

APIs

I see you talk about ways that they allow downloading the data in bulk, but do any of them off apis so that you don't need to download bulk data ?