The answer depends on what kind of search you are doing.
(1) For identity search or similarity search, you should provide a specific chemical structure as a query to find structures that are identical or similar to the query. If you have multiple compound queries to search for, you will need to perform independent identity/similarity searches. In this case, you will need to do it programmatically.
(2) For substructure/super structure search, you can provide a “pattern” of chemical structures (e.g., for those containing benzene ring with halogen substituents), if a database search system supports chemical representations for generic structures (i.e., patterns), such as SMARTS.
Go to Chemical Structure Search (<a href="https://pubchem.ncbi.nlm.nih.gov/search/search.cgi">https://pubchem.ncbi.nlm.nih.gov/search/search.cgi</a>) and do a *substructure* search using this query:
a. SMILES string query: c1(c(c(c(c(c1F)N)F)N)F)N
b. SMARTS string query: c1(c(c(c(c(c1[F,Cl,Br,I])N)[F,Cl,Br,I])N)[F,Cl,Br,I])N
The second one looks somewhat complicated, but in essence, all F atoms are replaced with [F,Cl,Br,I], meaning that those atoms are any of the four halogen atoms. Note that the second SMARTS query includes the first SMILES query and other molecules with heavier halogens. One caveat of substructure search with a generic query is that it often takes too long, and the search will fail due to time-limit (30 seconds for substructure search, I believe). If the query is too generic (e.g., find any compounds with a hydroxyl group), it will return too many hits (often more than millions).
If you have multiple query molecules that cannot be represented using a single SMARTS string (meaning they do not share a common structural characteristic or pattern), or if the database does not support SMARTS as a query language, you will need to do the searches programmatically.
Is there a way to avoid doing separate individual searches, and performing a single search where it returns all applicable results containing the designated halogens?
This project is designed to assist forensic chemists, who are required to analyze very dilute samples. Sometimes those tedious extractions can be avoided by exploiting the differing solubilities of compounds being extracted in different solvents. This solubility information is not very easy to locate and is not always trustworthy. This project will allow for solubility information to be more easily accessible and provide information such as which compounds are soluble in which solvents and to what degree. The information required for this project will be procured by combining data on solubility and solvents from publicly available databases. The objective is to create a spreadsheet that connects solutes to online databases and assists the user in finding appropriate solvents. If at all possible, we would like to enable green chemistry metrics within this function.
I just learned about an openly available solubility database that should be helpful for this project. Here's the URL: <a href="http://srdata.nist.gov/solubility/">http://srdata.nist.gov/solubility/</a>
Different databases use different standardization methods that suit for their needs/standards. With that said, PubChem’s standardization method is briefly explained of Pages 4-5 of the pdf version of this paper:
<a href="https://doi.org/10.1093/nar/gkv951">https://doi.org/10.1093/nar/gkv951</a>
Hi Daniel,
The general term for putting equivalent representations of data into a standard form is "canonicalization." Wikipedia is one starting point: <a href="https://en.wikipedia.org/wiki/Canonicalization">https://en.wikipedia.org/wiki/Canonicalization</a>
Putting on my historian-of-science hat, an early and influential method for canonicalizing chemical structures is called the "Morgan algorithm" (it's named for the CAS scientist who came up with it in the early 60s). Here's a short blog post with bread crumbs to other resources: <a href="https://graphiteworks.wordpress.com/2011/08/31/chemoinformatics-curiosities-i-the-morgan-algorithm/">https://graphiteworks.wordpress.com/2011/08/31/chemoinformatics-curiosities-i-the-morgan-algorithm/</a>
And here is Morgan's original paper: <a href="http://dx.doi.org/10.1021/c160017a018">http://dx.doi.org/10.1021/c160017a018</a> (J. Chem. Doc. vol. 5, no. 2 (1965): 107-113).
I’ve posted two files: (1) a figure that shows the concept of conversion between base-2 (binary) and base-64 numbers, and (2) a code snippet that shows how the decoding process is implemented. These files appear at the bottom of this page (<a href="http://olcc.ccce.divched.org/2015OLCCModule6P1TLO-2-5-1">http://olcc.ccce.divched.org/2015OLCCModule6P1TLO-2-5-1</a>).
In essence, the binary form of PubChem fingerprints is an 881-character long sequence of 0’s and 1’s. Because it is too long, it is converted into a base-64 integer. For example, “A” is used for a bit string. “000000”, “B” for “000001”, “C” for “000010”, “D” for “000011” and so on. This idea is illustrated in the “fingerprint_decoding.gif” file.
While this course does not require students to learn a programming language to write a code, the code in the “fingerprint_decoding.txt” gives you some idea about how the decoding/encoding can be done.
In addition, please refer to the bottom of the PubChem fingerprint definition file (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt">ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt</a>).
====
Decoding PubChem Fingerprints
PubChem fingerprints are currently 881 bits in length. Binary data is stored in one byte increments. The fingerprint is, therefore, 111 bytes in length (888 bits), which includes padding of seven bits at the end to complete the last byte. A four-byte prefix, containing the bit length of the fingerprint (881 bits), increases the stored PubChem fingerprint size to 115 bytes (920 bits).
When PubChem fingerprints are encoded in base64 format, the base64-encoded fingerprints are 156 bytes in length. The last two bytes are padding so that the base64 length is divisible by four (156 bytes - 2 bytes = 154 bytes). Each base64 byte encodes six binary bits (154 bytes * 6 bits/byte = 924 bits). The last four bits are padding to complete the last base64 byte (924 bits - 4 bits = 920 bits). The resulting 920 binary bits (115 bytes) are described in the previous paragraph.
====
I'm confused as to how you convert a base-64 integer into binary format so that you could use the PubChem Substructure Fingerprint Description? I see the base-64 alphabet, but I'm not sure what those numbers mean. Are they the bit positions?
Actually, it is highly recommendable to keep records on BOTH your search strategy AND search results. In addition to the very complex nature of chemical structure search, data contents in the public chemical databases keep changing, and the search results may also be affected by this change. This often makes things very complicated when you need to re-run your search months later while you are working on a year-long project. Therefore, you need to save your search strategy (query & options/filters and other limits), as well as search results. I strongly suggest that you keep these two things in text files. When you use PubChem Structure search, you can save your query and options so that you can re-run the search later with the same query/options, by simply importing the query/option information from the file. You can find the “save” icon at the bottom of the PubChem Chemical Structure Search page under any tab except for name/text search. The saved file can be imported from the file by using the right-most tab in the PubChem Chemical Structure Search tool.
Two potential solutions depending on the types of search.
Is there a way to avoid doing
Abstract
IUPAC-NIST Solubility Database
PubChem's standardization method
Canonicalization
More detail
Please see the additional files at the bottom of this page.
Conversion Question
Keep BOTH search strategy AND search results