Discussion | DivCHED CCCE: Cheminformatics OLCC

See my comment below

I thought I replied to your comment, but it seems I didn't. See my comment "It's actually simple" below.

It's actually simple.

The primary issue is that the rows and columns in the two files are not sorted in the same way. Please follow these steps.

[1] For 2-D similarity score matrix.

(1a) Select all data and sort the data by the first column (that is, CIDs) in ascending (increasing) order. Now your *columns* are sorted by CID.

(1b) Transpose the whole matrix (e.g., switch the rows and columns). To do this, select the whole matrix and copy it into clipboard (Ctrl+C). Then, paste it using "paste special"->"transpose" option. Because you transposed the columns/rows, now your *rows* are sorted by CID.

(1c) Sort the columns by CID in ascending order [as you did in (1a)]. Now both columns and rows of the 2-D score matrix is sorted by CID.

[2] Sort the 3-D score matrix in the same way you did in [1].

[3] Now, both 2-D and 3-D score matrices are sorted in the same way. Place the two matrices side by side, by copying and pasting one of them to the other spread sheet.

[4] Now you can compute the difference between the elements in the two matrices.

[5] Use the max() and min() functions to find the extreme differences. (It would be helpful if you compute the max. and min. values for each row and column first, then the values for the whole matrix.)

Question on comparing 2D/3D

I downloaded the two csv files in questin 5.c.iii, and have two matrices, showing all possible Tanomoto coefficients based on 2D and 3D. You then ask what combinations have the largest and smallest difference. How can you quickly do that? There are 32 columns and 32 rows, giving 1024 cells, although only half are unique. But, you have two matrices, with identical columns and rows, and want to find which cell has the largest, and smallest difference. How do you do that? Robert

PubChem Fingerprints are (Base-64) encoded binary numbers

>> Is the dendrogram the fingerprint?

No. The phrase "substructure fingerprint" within parentheses is a part of the axis title "2-D Tanimoto Similarity (substructure fingerprint)".

>> I am unfamiliar with bytes and bits.

Well, a bit is a "binary" digit (that is, it can be either 0 or 1). Two-digit binary numbers (11, 10, 00, 01) are 2-bit long, and three-digit binary numbers (111, 110, 101, ..., 000) are 3-bit long. One byte is equal to 8 bits, means that it can be 11111111 ~ 00000000.

>> what does a fingerprint look like?

PubChem fingerprints are a 881-long sequence of 0's and 1's. In practice, it is too long, we encode this string of 0's and 1's into a base-64 integer (See Table 1 of this webpage: http://www.faqs.org/rfcs/rfc3548.html). Conceptually, it is very similar to conversion of binary numbers to hex-numbers. Hex-numbers use 16 digits (0~9 and A,B,C,D,E,F) to encode binary numbers. Similarly, Base 64 numbers use 64 digits (listed in Table 1 of the webpage http://www.faqs.org/rfcs/rfc3548.html). Consider that Base16- or Base64-encoding is just one way to "compress" fingerprints, which are binary numbers.

Well, I googled some introductory material for this topic, and I hope these two material would help you better understand it.

http://computer.howstuffworks.com/bytes.htm (please go through all five pages on this web document) https://web.njit.edu/~walsh/powers/bits.vs.bytes.html

In the last figure below on

In the last figure below on the left-hand side, right above the dendrogram, it says "(substructure fingerprint)". Is the dendrogram the fingerprint? I read that "stored PubChem fingerprint size to 115 bytes (920 bits)." (<a href="https://hyp.is/stGg0hN_EeevJEPczGq4lg">https://hyp.is/stGg0hN_EeevJEPczGq4lg</a>) But, I am unfamiliar with bytes and bits. I am unclear on if the fingerprint is binary or some other code. I guess what I am asking is, what does a fingerprint look like?

Tanimoto coefficient

I was just wondering how to distinguish when and which molecular similarity coefficient is to be used in distinguishing two molecular fingerprints as seen in this article. <a href="https://goo.gl/ysgR0h">https://goo.gl/ysgR0h</a>

No, you can't.

>> Can you use Tanimoto coefficients to identify stereoisomers? No you can't. Pubchem fingerprint does not contain all possible fragments in its fingerprint definition. Even if two molecules have a similarity score of 1, they may differ in the absence/presence of a fragment that is not pre-defined in PubChem fingerprint. And there is no guarantee that this fragment has anything to do with stereochemistry.

>> is this the crux of 2D and 3D similarity are different? (or one of the major reasons)?

Most studies that compare 2-D and 3-D methods focused on how they are different in terms of performance (e.g., how many hits are identified or how many hits are unique to one method or common to both, etc.), but these studies were not designed to study why they give different results. So, there is no good answer to your question. However, in my view, the way in which 2-D fingerprint-based method encode molecular structures is completely different from 3-D similarity methods, and this difference make 3-D methods to recognize molecular similarity that 2-D methods can't. (Of course, PubChem's 2-D and 3-D similarity methods are not reprentatives of many existing 2-D and 3-D similarity methods, so we would end up different conclusions if we use different methods).

Imperfection of molecular descriptors

Any similarity methods has two important components:

(1) Molecular descriptors that describe the structure of molecules and

(2) Metric that quantifies molecular similarity by comparing molecular descriptors that represent two molecules.

Ideally, isomers (whether they are stereoisomers, constitutional isomers, or enantiomers ...) are different in "some" context. So if you have a "perfect" molecular descriptor (meaning that it can describe the difference between various isomers), you would expect that these isomers would have different molecular descriptors. However, no molecular descriptors are perfect, they often lead to the same set of molecular descriptors for different molecules. The molecular descriptor used in this homework (PubChem fingerprints) does not contain any information on stereochemistry (in other words, no bit position in the fingerprint encodes stereochemistry information). Actually, many commonly-used fingerprints do not encode stereochemistry and isotopism, so the similarity comparision using these fingerprints cannot distinguish stereoisomers or isotopomers. To demonstrate the consequence of this "artifact", compute the 2-d similarity score for CID 10900, 643833, 638186 (using the Score Matrix Service: https://pubchem.ncbi.nlm.nih.gov/score_matrix/score_matrix.cgi). (They are 1,2-dichloroethene). You will get the similarity score of 1 for all possible pairs from three compounds.

For Group B Compounds in this homework question, some of them are isomers and others are not, because they have different substituents. However, all of them have a common structural unit (four-membered ring), which is a structural characteristic of steroids compounds. So, we say that these compounds have the same (molecular) scaffold. The similarity scores used for the dendrogram are available through the buttom "Export Similarity data" below the dendrogram. If you look at these scores you will realize they do have very high similarity scores. (Actually, to solve question 4-c-iii, students need to download the similarity scores and analyze them).

It's a problem with histories cached in your browser.

If you encounter an error message saying something like "History Not Found", it typically means that there is an issue with the histories cached in your browser. In practice, you can try three things.

(1) Try a different browser (e.g., Chromes, IE, Edge, Safari, Firefox, ... any browser you haven't encountered the problem). This is the simplest solution that I recommend.

(2) the second option is to delete all caches stored in your browser to clear up your previous entrez histories.

(3) Entrez histories will expire after some period of inactivity (I believe, 8 hours of inactivity). In other words, if you don't use any NCBI resource (through Entrez) for 8 hours, the problematic histories will be expired and everything will go back to normal.

Let me know whether the issue is resolved or not.

Sunhwan,

my real question

What I was leading to, is can you use Tanimoto coefficients to identify stereoisomers? Also, is this the crux of 2D and 3D similarity are different? (or one of the major reasons)?