1.1. PubChem: chemical information repository at the U.S. NIH

        PubChem (https://pubchem.ncbi.nlm.nih.gov)1-5 is a public repository of information on small molecules and their biological activities, developed and maintained by the National Library of Medicine (NLM), an institute within the U.S. National Institutes of Health (NIH).  Since its launch in 2004 as a component of the NIH’s Molecular Libraries Roadmap Initiatives, it has been rapidly growing, and now serves as a key chemical information resource for researchers in many biomedical science areas, including cheminformatics, chemical biology, and medicinal chemistry.  Detailed information on PubChem can be found in recent papers published in Nucleic Acids Research (https://doi.org/10.1093/nar/gkv951; https://doi.org/10.1093/nar/gkt978).

        PubChem is a data aggregator, meaning that it collects data from different data sources.  Currently, PubChem’s data are from more than 350 organizations, including government agencies, university labs, pharmaceutical companies, substance vendors, and other databases.  PubChem organizes its data into three primary databases: Substance, Compound, and BioAssay.  Individual data contributors deposit information on chemical substances to the Substance database (https://www.ncbi.nlm.nih.gov/pcsubstance).    Different data contributors may provide information on the same molecule, hence the same chemical structure may appear multiple times in the Substance database.  To provide a non-redundant view, chemical structures in the Substance database are normalized through a process called “standardization” and the unique chemical structures are identified and stored in the Compound database (https://www.ncbi.nlm.nih.gov/pccompound).  The difference between the Substance and Compound databases is explained in more detail in this blog post.  Descriptions of biological experiments on chemical substances are stored in the BioAssay database (https://www.ncbi.nlm.nih.gov/pcassay).  The database identifiers used to locate records in these three databases are called SID (Substance ID), CID (Compound ID), and AID (Assay ID) for the Substance, Compound, and BioAssay databases, respectively.

        PubChem contains more than 157 million depositor-provided substances, 60 million unique chemical structures, and one million biological assays, which cover about 10 thousand protein target sequences.  For efficient use of this vast amount of data, PubChem provides various search and analysis tools.  Some of these search tools will be used later in this module for demonstration purposes.

No votes yet
Join the conversation.

Comments 3

Daniel Graham (not verified) | Mon, 10/26/2015 - 22:50
Hey! Does anyone have a good resource on standardization? It seems like an interesting problem to solve.

Evan Hepler-Smith's picture
Evan Hepler-Smith | Tue, 10/27/2015 - 06:37
Hi Daniel, The general term for putting equivalent representations of data into a standard form is "canonicalization." Wikipedia is one starting point: <a href="https://en.wikipedia.org/wiki/Canonicalization">https://en.wikipedia.org/wiki/Canonicalization</a> Putting on my historian-of-science hat, an early and influential method for canonicalizing chemical structures is called the "Morgan algorithm" (it's named for the CAS scientist who came up with it in the early 60s). Here's a short blog post with bread crumbs to other resources: <a href="https://graphiteworks.wordpress.com/2011/08/31/chemoinformatics-curiosities-i-the-morgan-algorithm/">https://graphiteworks.wordpress.com/2011/08/31/chemoinformatics-curiosities-i-the-morgan-algorithm/</a> And here is Morgan's original paper: <a href="http://dx.doi.org/10.1021/c160017a018">http://dx.doi.org/10.1021/c160017a018</a> (J. Chem. Doc. vol. 5, no. 2 (1965): 107-113).

Sunghwan Kim | Tue, 10/27/2015 - 08:51
Different databases use different standardization methods that suit for their needs/standards. With that said, PubChem’s standardization method is briefly explained of Pages 4-5 of the pdf version of this paper: <a href="https://doi.org/10.1093/nar/gkv951">https://doi.org/10.1093/nar/gkv951</a>