2.5.1 Two-dimensional (2-D) similarity methods

2015OLCCM6P1fig2a.jpg

Molecular similarity methods can be broadly classified into two-dimensional (2-D) and three-dimensional (3-D) similarity methods. Typically, 2-D similarity methods use so-called molecular fingerprints, which encode structural information of a molecule into a binary string (that is, a string of 0’s and 1’s). The position of each number in this string corresponds to a particular fragment. If the molecule has a particular fragment, the corresponding bit position is set to 1, and otherwise to 0. Note that there are many different ways to design molecular fingerprints, depending on what fragments are included in the fingerprint definition. PubChem uses its own fingerprint called PubChem subgraph fingerprints.

In 2-D similarity methods, structural similarity between two molecules is estimated by comparing their molecular fingerprints. Their similarity is quantified as a so-called similarity score or similarity coefficient. While several different methods can be used for computation of a similarity score, the underlying ideas are the same as each other: if the two fingerprints have 1’s at the same position, it means that both compounds have the same fragment, and if the molecules share more common fragments, they are considered to be more similar. In conjunction with the PubChem subgraph fingerprints, PubChem 2-D similarity method use the Tanimoto coefficient

where N_A and N_B are the number of bits set in the fingerprints for molecules A and B, respectively, and N_AB is the number of bits set in both fingerprints. The Tanimoto score ranges from 0 (for no similarity) to 1 (for identical molecules). 2-D Similarity search returns molecules whose similarity scores with the query molecule are greater than or equal to a given Tanimoto cut-off value.

Rating:

No votes yet

Join the conversation.

Comments 7

Multiple Fragments

Dr. Kim, Does the Tanimoto coefficient take into account multiple copies of a particular fragment? What happens if you compare a molecule that has four benzene rings with a molecule that has only one benzene ring?

The fragment counts can be encoded into the fingerprints.

Although there may be several approaches, I will explain what is being used in PubChem. In PubChem’s 2-D similarity method, the number of a particular fragment is encoded into the fingerprint definition. As an example, please check the PubChem fingerprint definition available at: <a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt">ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt</a> The first four bit positions in this definition are: === Section 1: Hierarchic Element Counts - These bits test for the presence or count of individual chemical atoms represented by their atomic symbol. Bit Position Bit Substructure 0 >= 4 H 1 >= 8 H 2 >= 16 H 3 >= 32 H === According to this definition, the first four bits of a compound with 40 hydrogen atoms will be “1111”, and those of a compound with 10 hydrogen atoms will be “1100”. In this way, you can distinguish compounds with a particular fragment from compounds with a different number of the same fragment. It should be noted that you may choose a different fingerprint definition. For example, rather than using >= 4, >=8, >=16, >=32 hydrogens for the first four bits, you may choose >=8, >=16, >=24, >=32, >=40 hydrogens for the first five bits (note that there is one additional bit for >=40 hydrogens). Of course, you can choose *not* to make any distinction between different hydrogen atom counts. With that said, it should be noted that different groups have devised different fingerprint definitions to meet their own needs, and the PubChem fingerprint is only one of them. To answer your original question about the number of benzene ring fragments, please check the fragment definition for bit positions at 178-212 (from the PubChem fingerprint definition file through the link above). 179 >= 1 saturated or aromatic carbon-only ring size 6 ... 186 >= 2 saturated or aromatic carbon-only ring size 6 ... 193 >= 3 saturated or aromatic carbon-only ring size 6 ... 200 >= 4 saturated or aromatic carbon-only ring size 6 ... 207 >= 5 saturated or aromatic carbon-only ring size 6 According to this definition, compounds with four benzene rings will have a bit sequence of “1...1...1...1...0” and compounds with one benzene ring will have a bit sequence of “1...0...0...0...0”.

Answers to Student's questions that disappeared.

It seems that somebody posted questions to this page, but somehow they disappeared (probably due to some technical issues?). I’m re-positing the questions with my answers. >>>> (1) How do you access a molecule's fingerprint from PubChem? I can do similarity searches, but I cannot find specific molecular fingerprints. You can find it from a 2-D structure file downloaded from the molecule’s summary page. Visit this page (for CID2244, aspirin) and click the download button on the top-left corner. You will see several download options from the menu. Click the “Display” button for the SDF file format under “2D Structure” section. It will display relevant data on the web browser. The fingerprint appear under the PUBCHEM_CACTVS_SUBKEYS section. > "PUBCHEM_CACTVS_SUBSKEYS" ! The brakets have been changed to the quotes to avoid conflicts w/ html. AAADccBwOAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGgAACAAADASAmAAyDoAABgCIAiDSCAACCAAkIAAIiAEGCMgMJzaENRqCe2Cl4BEIuYeIyCCOAAAAAAAIAAAAAAAAABAAAAAAAAAAAA Note that the fingerprint in the file is converted into a base-64 integer, meaning that you will need to convert it to a binary format, as explained the PubChem fingerprint description file (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt">ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt</a>). ==== Beginning of excerpt from the file ==== When exporting fingerprint information in the SD file format, the SD tag for the PubChem Substructure Fingerprint property is "PUBCHEM_CACTVS_SUBGRAPHKEYS". The PubChem Substructure Fingerprint is Base64 encoded to provide a textual representation of the binary data. For a description of the Base64 encoding and decoding algorithm specification, go to: <a href="http://www.faqs.org/rfcs/rfc3548.html">http://www.faqs.org/rfcs/rfc3548.html</a> ==== End of excerpt from the file ==== You can download these data in different file formats, from individual compound summary pages or from the download facility (<a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a>). You can download them from the PubChem FTP site (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem">ftp://ftp.ncbi.nlm.nih.gov/pubchem</a>). It is also possible to get them programmatically. For example, you can use a PUG-REST request like: <a href="https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/SDF/?record_type=2d&response_type=display">https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244/record/SDF/?record_type=2d&response_type=display</a> More detailed information on PUG-REST can be found in this paper (<a href="http://nar.oxfordjournals.org/content/43/W1/W605.long">http://nar.oxfordjournals.org/content/43/W1/W605.long</a>). Please see Figure 2 of this paper to get some idea about how to formulate PUG-REST requests programmatically. >>>> (2) Is it possible to identify functional groups for algorithmically created chemical identifiers by transforming them to molecular fingerprints (and analyzing the fingerprints instead of going through mol files)? It is not clear what you are asking exactly, but my guess is that you are asking whether you can figure out if a compound has a particular functional group, by checking the compound’s fingerprint. In theory, it is possible, if your fingerprint definition includes a fragment corresponding to the functional group of interest. However, in practice, you can do the same thing simply by searching the database using a SMARTS string as a query.

finger prints and identifying functional groups

Let me followup on this student's question. One of the suggested projects was to figure how to determine the functional groups within a molecule from it's InChI. My understanding is that the InChI canonicalization algorithm does not use functional groups, and that is not directly possible, but that an InChI could be converted to a mol file, or some other format, and that could be done. This lead to the question of could a finger print be used to identify functional groups, and if so, could you make a fingerprint out of any form of a chemical identifier (InChi, SMILES,....), and know the functional groups? Instead of using a Tanimoto coefficient to use the fingerprint to determine similarity, could you simply have a yes/no answer, and how many? Could you make some sort of comparison matrix, where you compare the identifier to a series of fake molecules,where each fake molecule represent specific functional groups? If you get a 1, it is there, if you get a zero, it is not. (I am not clear on how you count them). Cheers, Bob

Fingerprints can be used for that purpose, but......

If you know the structure of a molecule (in the form of xyz coordinates, molecular graph, etc.), you can generate a fingerprint from it. Therefore, if you have correct/valid structure-based chemical identifiers (e.g., InChI, SMILES, and so on) that have information on molecular structures, you can generate molecular fingerprints from them. Molecular fingerprints represent the presence or absence of a particular functional group in a molecule, and one can use fingerprints to encode the number of a particular functional group. Therefore, they can be used to identify functional groups in molecules and how many, using the approach that you suggested in your post. However, in essence, what you are trying to do is “substructure” search, which is discussed in Section 2.4 of Module 6 (<a href="http://olcc.ccce.divched.org/2015OLCCModule6P1TLO-2-4">http://olcc.ccce.divched.org/2015OLCCModule6P1TLO-2-4</a>). What you are trying to do can also be done by substructure search with a SMARTS string (that represents a functional group) as a query. (Of course, you will need to convert InChI to SMILES before the substructure search).

Conversion Question

I'm confused as to how you convert a base-64 integer into binary format so that you could use the PubChem Substructure Fingerprint Description? I see the base-64 alphabet, but I'm not sure what those numbers mean. Are they the bit positions?

Please see the additional files at the bottom of this page.

I’ve posted two files: (1) a figure that shows the concept of conversion between base-2 (binary) and base-64 numbers, and (2) a code snippet that shows how the decoding process is implemented. These files appear at the bottom of this page (<a href="http://olcc.ccce.divched.org/2015OLCCModule6P1TLO-2-5-1">http://olcc.ccce.divched.org/2015OLCCModule6P1TLO-2-5-1</a>). In essence, the binary form of PubChem fingerprints is an 881-character long sequence of 0’s and 1’s. Because it is too long, it is converted into a base-64 integer. For example, “A” is used for a bit string. “000000”, “B” for “000001”, “C” for “000010”, “D” for “000011” and so on. This idea is illustrated in the “fingerprint_decoding.gif” file. While this course does not require students to learn a programming language to write a code, the code in the “fingerprint_decoding.txt” gives you some idea about how the decoding/encoding can be done. In addition, please refer to the bottom of the PubChem fingerprint definition file (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt">ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt</a>). ==== Decoding PubChem Fingerprints PubChem fingerprints are currently 881 bits in length. Binary data is stored in one byte increments. The fingerprint is, therefore, 111 bytes in length (888 bits), which includes padding of seven bits at the end to complete the last byte. A four-byte prefix, containing the bit length of the fingerprint (881 bits), increases the stored PubChem fingerprint size to 115 bytes (920 bits). When PubChem fingerprints are encoded in base64 format, the base64-encoded fingerprints are 156 bytes in length. The last two bytes are padding so that the base64 length is divisible by four (156 bytes - 2 bytes = 154 bytes). Each base64 byte encodes six binary bits (154 bytes * 6 bits/byte = 924 bits). The last four bits are padding to complete the last base64 byte (924 bits - 4 bits = 920 bits). The resulting 920 binary bits (115 bytes) are described in the previous paragraph. ====

Comments 7

Annotations