You are here

4. Understanding Public Chemical Databases

Sunghwan Kim, National Center for Biotechnology Information

Learning Objectives

  • Explain what primary and secondary databases are.
  • Explain what data provenance is.
  • Review publicly available chemical databases in different domains.
  • Understand how PubChem data are organized.
  • Learn how to critically assess data in public databases.


1.    (Some) database basics


1.1. What is a database?

    A database is an “organized collection of information.”  The information in a database can be in any format, including texts, numbers, images, audios, videos, and many others (and combination of these), but this information must be “organized” for efficient retrieval.  According to this definition, a database is not necessarily electronic (i.e., accessible by computers).  For example, the collection of names in a phone book or address book may also be considered as a database, because the names are arranged (typically in alphabetical order) to make it easy to search for necessary information (e.g., phone numbers or addresses).  However, in computer science and related areas, a database usually means an electronic database.  Therefore, the term “database” in this module is used to mean an “electronic database”.


1.2. Primary databases vs. secondary databases

    Databases are often categorized into primary and secondary databases. 

  • Primary databases contain experimentally-derived data that are directly submitted by researchers (also called “primary data”).  In essence, these databases serve as archives that keep original data.  Therefore, they are also known as archival databases.
  • Secondary databases contain secondary data, which are derived from analyzing and interpreting primary data.  These databases often provide value-added information related to the primary data, by using information from other databases and scientific literature.  Essentially, secondary databases serve as reference libraries for the scientific community, providing highly curated reviews about primary data.  For this reason, they are also known as curated databases, or knowledgebase.

It should be noted that the distinction between primary and secondary databases is not always clear and that many databases have the characteristics of both primary and secondary database.  It is very common that a primary database curates its data with information drawn from secondary databases.  In addition, because many secondary databases make their value-added information available in the public domain, data exchange and integration among databases very frequently occurs.  As a results, virtually all data providers also becomes data consumers these days.


1.3. Data provenance

    The term “data provenance” refers to a record trail that describes the origin or source of a piece of data and the process by which it entered in a database.1  Simply put, data provenance deals with the questions “where the data came from” and “how and why the data is in its present place”.  Although the data provenance information is critical in the reliability of a data source (and its data), this information is not easy to manage.  In addition, information predicted in one database may not be appropriate for use in other databases, but may end up being integrated in them anyway.  Therefore, databases need to document the provenance of the data and devise a way to notify users of that information.  In turn, users should always pay attention to the data provenance issue when using a database.


2.    Public Chemical Databases

        These days many public online databases provide chemical information free of charge and the databases mentioned in this module are only a few examples of them.  Note that these databases vary in size and scope.


2.1. PubChem: chemical information repository at the U.S. NIH

        PubChem ( is a public repository of information on small molecules and their biological activities, developed and maintained by the National Library of Medicine (NLM), an institute within the U.S. National Institutes of Health (NIH).  Since its launch in 2004 as a component of the NIH’s Molecular Libraries Roadmap Initiatives, it has been rapidly growing, and now serves as a key chemical information resource for researchers in many biomedical science areas, including cheminformatics, chemical biology, and medicinal chemistry.  Detailed information on PubChem can be found in these three papers:


        As of February 2017, PubChem contains more than 235 million depositor-provided substances, 94 million unique chemical structures, and one million biological assays, which cover about 10 thousand protein target sequences.  For efficient use of this vast amount of data, PubChem provides various search and analysis tools.  Some of these search tools will be used later in this course for demonstration purposes.


2.2. ChemSpider: a chemical database integrated with RSC’s publishing process

        ChemSpider (,6 is a free chemical structure database, containing information on 34 million structures collected from ~500 data sources.  It also provides information on chemical reactions through ChemSpider SyntheticPages (CSSP)7.  ChemSpider uses a crowdsourcing approach that allows registered users for manual comment and correction of ChemSpider records.  Owned by the Royal Society of Chemistry (RSC), which publishes ~40 peer-reviewed chemistry journals, ChemSpider is integrated with the RSC publishing process, whereby new chemicals identified in newly published RSC articles also become available in ChemSpider.


2.3. ChEMBL: literature-extracted biological activity information

        ChEMBL (,9 is a large bioactivity database, developed and maintained by the European Bioinformatics Institute (EBI), which is part of the European Molecular Biology Laboratory (EMBL).  The core activity data in ChEMBL are “manually” extracted from the full text of peer-reviewed scientific publications in select chemistry journals, such as Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters, and Journal of Natural products.  From each publication, details of the compounds tested, the assays performed and any target information for these assays are abstracted.  ChEMBL also integrates screening results and bioactivity data from other public databases (such as PubChem BioAssay) and information on approved drugs from the U.S. FDA Orange Book10 and the NLM’s DailyMed11.


2.4. ChEBI: a dictionary of small molecular entity

        ChEBI (,13 stands for “Chemical Entities of Biological Interest”.  It is a freely available database of “small” molecular entities, developed at the European Bioinformatics Institute (EBI).  The molecular entities in ChEBI are either natural or synthetic products used to intervene the processes of living organisms.  As a rule, however, ChEBI does not contain macromolecules directly encoded by genome (e.g., nucleic acids, proteins, and peptides derived from protein by cleavage).  ChEBI provides “standardized” descriptions of molecular entities that enable other databases to annotate their entries in a consistent fashion. ChEBI focuses on high-quality manual annotation, non-redundancy, and provision of a chemical ontology rather than full coverage of the vast chemical space.  Note that both ChEMBL and ChEBI are developed and maintained by the EMBL-EBI.  While ChEMBL focuses on “bioactivity” of a large number of bioactive molecules (currently ~2.0 millions), ChEBI is a “dictionary” that provides high-quality standardized descriptions for a relatively small number of molecules (currently ~50 thousands).


2.5. NIST Webbook: thermodynamic and spectroscopic data of chemicals

        The U.S. National Institutes of Standards and Technology (NIST) compiles chemical and physical property data for chemical species and distributes them through the web site called the NIST Chemistry WebBook (,15.  These data include thermochemical data (e.g., enthalpy of formation, heat capacity, and vapor pressure), reaction thermochemistry data (e.g., enthalpy of reaction and free energy of reaction), spectroscopic data (e.g., IR and UV/Vis spectra), gas chromatographic data, ion energetics data, and so on.


2.6. DrugBank: comprehensive information on drug molecules

        DrugBank16-18 ( is a comprehensive online database containing biochemical and pharmacological information about ~8,000 drug molecules, including U.S. Food and Drug Administration (FDA)-approved small-molecule drugs and biotech drugs (e.g., protein/peptide drugs) as well as experimental drugs.  DrugBank provides a wide range of drug information, including drug targets, mechanism of action, adverse drug reactions, food-drug and drug-drug interactions, experimental and theoretical ADMET properties (i.e., Absorption, Distribution, Metabolism, Excretion, and Toxicity), and many others.  Most of these data are curated from primary literature sources, by domain-specific experts and skilled biocurators.


2.7. HMDB: the Human Metabolome Database

        The Human Metabolome Database (HMDB) ( is comprehensive information on human metabolites and human metabolism data.  This database contains curated information derived from scientific literature, as well as experimentally determined metabolite concentrations in human tissue or biofluid (e.g., urine, blood, cerebrospinal fluid and so on).  Reference Mass spectra (MS) and nuclear magnetic resonance (NMR) spectra for metabolites are also provided when available.  In addition to data for “detected” metabolites (those with measured concentrations or experimental confirmation of their existence), the HMDB also provides information on “expected” metabolites (those for which biochemical pathways are known or human intake/exposure is frequent but the compound has yet to be detected in the body).


2.8. TOXNET: a collection of toxicological information

        TOXNET (, maintained by the National Library of Medicine (NLM) at NIH, is a group of databases covering toxicology, hazardous chemicals, toxic releases, environmental and occupational health, risk assessment.  Currently, 16 databases are integrated into the TOXNET system, and users can search all these databases either at once or individually.  While all the 16 databases provide valuable information, three of them may be worth mentioning in the context of this course.

  • ChemIDPlus26,27 is a dictionary of over 400,000 chemical records (names, synonyms, and structures) and provides access to the structure and nomenclature files used for the identification of chemical substances in the TOXNET system and other NLM databases. 
  • The Hazardous Substances Data Bank (HSDB)28,29 focuses on the toxicology of potentially hazardous chemicals, providing information on human exposure, industrial hygiene, emergency handling procedures, environmental fate, regulatory requirements, nanomaterials, and related areas. All HSDB data are referenced and derived from a core set of books, government documents, technical reports and selected primary journal literature. Importantly, HSDB is peer-reviewed by the Scientific Review Panel (SRP), a committee of experts in the major subject areas within the data bank's scope.
  • The Comparative Toxicogenomics Database (CTD)30,31  contains manually curated data describing interactions of chemicals with genes/proteins and diseases.  This database provides insight into the molecular mechanisms underlying variable susceptibility for environmentally influenced diseases.

A brief overview of TOXNET and its databases can be found in the TOXNET Fact Sheet23 and a recent paper by Fowler and Schnall25.


2.9. Protein Data Bank (PDB): a key source for protein-bound ligand structures

        The Protein Data Bank (PDB) is an archive of the experimentally determined 3-D structures of large biological molecules such as proteins and nucleic acids. These structures were determined primarily by using X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.  While PDB is not a small molecule database, it contains the 3-D structures of many proteins with small-molecule ligands bound to them.  PDB allows users to search for proteins that an input small molecule binds to.  Considering that it is not possible to experimentally determine how small molecules (such as drug or toxic chemicals) actually bind to their target proteins in a living organism, PDB is the most widely used resource for experimentally determined protein-bound structures of small molecules.  The PDB are maintained by the Worldwide PDB (wwPDB)32, and freely accessible via the websites of its member organizations: PDBe (PDB in Europe)33,34, PDBj (PDB Japan)35,36, RCSB PDB (Research Collaboratory for Structural Bioinformatics PDB)37,38.


3. Data Organization in PubChem as a Data Aggregator

        PubChem ( is a data aggregator, meaning that it collects data from other data sources.  As of February 2017, PubChem’s data are from more than 500 organizations, including government agencies, university labs, pharmaceutical companies, substance vendors, and other databases.  An up-to-date list of PubChem’s data sources is available at the PubChem Sources page (  To better understand the features of this page, read this article on PubChem Blog:

        PubChem organizes its data into three inter-linked databases: Substance, Compound, and BioAssay (See Table 1), which can be searched from either the PubChem home page ( or the web page of one of the three PubChem databases.


Table 1. Three inter-linked databases in PubChem.











Individual data contributors deposit information on chemical substances to the Substance database (  Different data contributors may provide information on the same molecule, hence the same chemical structure may appear multiple times in the Substance database.  To provide a non-redundant view, chemical structures in the Substance database are normalized through a process called “standardization” and the unique chemical structures are identified and stored in the Compound database (  The difference between the Substance and Compound databases is explained in more detail in this blog post.

Descriptions of biological experiments on chemical substances are stored in the BioAssay database (  The unique identifiers used to locate records in these three databases are called SID (Substance ID), CID (Compound ID), and AID (Assay ID) for the Substance, Compound, and BioAssay databases, respectively.

        All information in the Substance database is submitted by individual data depositors.  However, the Compound database does contain information that are not submitted by data depositors, but annotated by the PubChem team.  [In the context of scientific databases, annotation refers to the process of adding extra information to a database entry (for example, a compound in the Compound database and an assay in the BioAssay database)].  The annotated information is always presented with its provenance information (that is, the source of the information).  The list of all the annotation sources used in PubChem is available at the PubChem Sources page (  From this page, one may download all the annotations from a particular source.


4. Special notes on using public chemical databases.

        All the databases mentioned above (including PubChem) are public databases that provide their contents free of charge, and in many cases they also provide a way to download data in bulk and integrate them into one’s own database.  Therefore, it is very common that database groups exchange their information with each other.  This often raises some technical concerns.  For example, as mentioned in Part 3 of Module 2, different databases may use different chemical representations to refer to the same molecule.  This may result in incorrect chemical structure matching between the databases, leading to incorrect data integration.  In addition, when one database has incorrect information, this error often propagates into other databases.  The error propagation issue is a serious, but very common, problem.39,40  Therefore, when using information in these databases, one should keep in mind various data accuracy and quality issues prevalent in these databases.  A goal of this course is to help students develop the ability to critically assess chemical information available in public databases.



(1)        Ram, S.; Liu, J. In SWPM'09 Proceedings of the First International Conference on Semantic Web in Provenance Management Washington, D.C. , 2009; Vol. 526, p 35.

(2)        Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L. Y.; He, J. E.; He, S. Q.; Shoemaker, B. A.; Wang, J. Y.; Yu, B.; Zhang, J.; Bryant, S. H. Nucleic Acids Res. 2016, 44, D1202.

(3)        Wang, Y.; Bryant, S. H.; Cheng, T.; Wang, J.; Gindulyte, A.; Shoemaker, B. A.; Thiessen, P. A.; He, S.; Zhang, J. Nucleic Acids Res. 2017, 45, D955.

(4)        Kim, S. Expert Opinion on Drug Discovery 2016, 11, 843.

(5)        ChemSpider ( (Accessed on 2/17/2017).

(6)        Pence, H. E.; Williams, A. J. Chem. Educ. 2010, 87, 1123.

(7)        ChemSpider SyntheticPages (CSSP) ( (Accessed on 2/17/2017).

(8)        ChEMBL ( (Accessed on 2/17/2017).

(9)        Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, E.; Davies, M.; Dedman, N.; Karlsson, A.; Magariños, M. P.; Overington, J. P.; Papadatos, G.; Smit, I.; Leach, A. R. Nucleic Acids Res. 2017, 45, D945.

(10)      Orange Book: Approved Drug Products with Therapeutic Equivalence Evaluations ( (Accessed on 2/17/2017).

(11)      DailyMed ( (Accessed on 2/17/2017).

(12)      ChEBI ( (Accessed on 2/17/2017).

(13)      Hastings, J.; de Matos, P.; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M.; Steinbeck, C. Nucleic Acids Res. 2013, 41, D456.

(14)      NIST Chemistry Webbook ( (Accessed on 2/19/2017).

(15)      Linstrom, P. J.; Mallard, W. G. J. Chem. Eng. Data 2001, 46, 1059.

(16)      DrugBank ( (Accessed on 2/19/2017).

(17)      About DrugBank ( (Accessed on 2/19/2017).

(18)      Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu, Y. F.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; Tang, A.; Gabriel, G.; Ly, C.; Adamjee, S.; Dame, Z. T.; Han, B. S.; Zhou, Y.; Wishart, D. S. Nucleic Acids Res. 2014, 42, D1091.

(19)      The Human Metabolome Database (HMDB) ( (Accessed on 2/19/2017).

(20)      About the Human Metabolome Database (HMDB) ( (Accessed on 2/19/2017).

(21)      Wishart, D. S.; Jewison, T.; Guo, A. C.; Wilson, M.; Knox, C.; Liu, Y. F.; Djoumbou, Y.; Mandal, R.; Aziat, F.; Dong, E.; Bouatra, S.; Sinelnikov, I.; Arndt, D.; Xia, J. G.; Liu, P.; Yallou, F.; Bjorndahl, T.; Perez-Pineiro, R.; Eisner, R.; Allen, F.; Neveu, V.; Greiner, R.; Scalbert, A. Nucleic Acids Res. 2013, 41, D801.

(22)      ToxNet ( (Accessed on 2/19/2017).

(23)      Factsheet - Toxicology Data Network (TOXNET) ( (Accessed on 2/19/2017).

(24)      Wexler, P. Toxicology 2001, 157, 3.

(25)      Fowler, S.; Schnall, J. G. Am. J. Nurs. 2014, 114, 61.

(26)      ChemIDplus ( (Accessed on 2/19/2017).

(27)      Fact Sheet - ChemIDplus ( (Accessed on 2/19/2017).

(28)      Hazardous Substances Data Bank (HSDB) ( (Accessed on 2/19/2017).

(29)      Fact Sheet - Hazardous Substances Data Bank (HSDB) ( (Accessed on 2/19/2017).

(30)      Comparative Toxicogenomics Database (CTD) ( (Accessed on 2/19/2017).

(31)      Fact Sheet - Comparative Toxicogenomics Database (CTD) ( (Accessed on 2/19/2017).

(32)      Worldwide Protein Data Bank (wwPDB) ( (Accessed on 2/19/2017).

(33)      Protein Data Bank in Europe (PDBe) ( (Accessed on 2/19/2017).

(34)      Gutmanas, A.; Alhroub, Y.; Battle, G. M.; Berrisford, J. M.; Bochet, E.; Conroy, M. J.; Dana, J. M.; Montecelo, M. A. F.; van Ginkel, G.; Gore, S. P.; Haslam, P.; Hendrickx, P. M. S.; Hirshberg, M.; Lagerstedt, I.; Mir, S.; Mukhopadhyay, A.; Oldfield, T. J.; Patwardhan, A.; Rinaldi, L.; Sahni, G.; Sanz-Garcia, E.; Sen, S.; Slowley, R. A.; Velankar, S.; Wainwright, M. E.; Kleywegt, G. J. Nucleic Acids Res. 2014, 42, D285.

(35)      Protein Data Bank Japan (PDBj) ( (Accessed on 2/19/2017).

(36)      Kinjo, A. R.; Suzuki, H.; Yamashita, R.; Ikegawa, Y.; Kudou, T.; Igarashi, R.; Kengaku, Y.; Cho, H.; Standley, D. M.; Nakagawa, A.; Nakamura, H. Nucleic Acids Res. 2012, 40, D453.

(37)      RCSB Protein Data Bank (RCSB PDB) ( (Accessed on 2/19/2017).

(38)      Rose, P. W.; Prlic, A.; Bi, C. X.; Bluhm, W. F.; Christie, C. H.; Dutta, S.; Green, R. K.; Goodsell, D. S.; Westbrook, J. D.; Woo, J.; Young, J.; Zardecki, C.; Berman, H. M.; Bourne, P. E.; Burley, S. K. Nucleic Acids Res. 2015, 43, D345.

(39)      Schnoes, A. M.; Brown, S. D.; Dodevski, I.; Babbitt, P. C. PLoS Comput. Biol. 2009, 5, e1000605.

(40)      Philippi, S.; Kohler, J. Nat. Rev. Genet. 2006, 7, 482.






PART I. Getting familiar with PubChem


1. Read these articles and answer the following questions.


(a) Explain the difference between the PubChem Substance and Compound databases in two or three sentences.



(b) Explain what the Compound Summary page of a compound is.



(c) Explain what the Substance Record page of a substance is.



(d) Explain the reason why the “legacy” designation was introduced in PubChem in two or three sentences.



(e) Among the menus available on the top of the PubChem home page ( is “Today’s Statistics”.  The number of compounds/substances/assays shown under this menu does not include “non-live” records.  What does “non-live” mean here?




2. While the PubChem Substance database is an archive in nature, data providers often want to update their substance information archived in PubChem.  For this reason, PubChem keeps all different “versions” of a substance record and shows the most recent version on its Substance Record page by default (Click here to read about what the Substance Record page is).  Go to the PubChem home page ( and follow the steps described below.


(a) After selecting the “Compound” tab above the search box, type “60823” in the search box and click the “Go” button.  This will direct you to the Compound Summary page for CID 60823 (atorvastatin). (Click here to read about what the Compound Summary page is.) [You will learn how to search PubChem in much more detail for next two modules (Modules 5 and 6).]



(b) Scroll down until you see “Contents” on the left column.  Expand this table of contents by clicking the “+” sign before “Contents”.  Locate the “Related Substances” section and click the record count for the “Same” item under that section.



(c) The previous step will lead you to the web page that presents 132 substance records associated with CID 60823 (atorvastatin).  This page is called a Document Summary (DocSum) page, because it presents a (very brief) summary of retrieved records.  Sort the list by SID (in ascending order) and click the substance that appear on the top of the DocSum page (that is, SID 9052).



(d) Clicking SID 9052 directs you to the Substance Record page for SID 9052.  From the Table of Contents, locate the “Modify Date” section.  Now you see a table that shows a list of the dates when this substance record was modified.


(e) Click each version of this record and fill in the table below with the information under the “Depositor Comments” and “Cross-References” sections.

Version 1









Version 2











Version 3











Version 4











Version 5











Version 6


Same as: D07474










(f) The “Depositor Comments” section of version 6 of this record has a link to another substance record in PubChem.  What is the SID of this substance record?  What is the CID of the compound associated with this substance record?  Is this CID the same as the one used in (a)?



(g) Briefly summarize (in three or four sentences) how depositor-provided information on SID 9052 has evolved over time.



3. Some records in PubChem are “non-live”, meaning that they are “not searchable”, although they do exist in the database.  This exercise is designed to help students better understand what non-live records are.


(a) In the previous exercise, you searched the PubChem Compound database for CID “60823”.  Now search the Compound database using the query “73336290”.  What message do you get?



(b) Repeat the search using the query “60823” (which we already know returns a hit) to go to the Compound Summary page for CID 60823.  What is the URL of this page (i.e., the web address)?




(c) Replace the string “60823” in the URL with “73336290” and press the “Enter” key on the keyboard.  This will get you to the Compound Summary page of CID 73336290, which you were not able to find in (a).  Indeed, this page presents the message “NOTE: NON-LIVE RECORD. See the related substances for more information.”  How many substances are associated with CID 73336290?  What are their SIDs?



(d) Go to the Substance Record page for the substance associated with CID 73336290 (by clicking the SID listed under the “Related Substances” section of its Compound Summary page).  What is the version number of this substance record?



(e) Go to the Substance Record page for the most recent version of the substance record in (d).  This page shows the message: “NOTE: REVOKED RECORD. See the revoke reason and the revision history for more information.”  What is listed as the revoke reason for this record?




PART II. Finding information on caffeine


4. Go to the PubChem home page ( After selecting the “Compound” tab above the search box, type “caffeine[CompleteSynonym]” in the search box and click the “Go” button.  It will lead you to the Compound Summary page for caffeine (CID 2519), which presents all information available in PubChem for this chemical.  [The suffix “[CompleteSynonym]” means that you want to search for compounds whose name exactly matches the query string (caffeine).  You will learn how to search PubChem in much more detail for next two modules (Modules 5 and 6).]


(a) Scroll down until you see “Contents” on the left column.  Expand this table of contents by clicking the “+” sign before “Contents”.  Locate the “boiling point” and “melting point” sections (under “Experimental Properties” of “Chemical and Physical Properties”).  Fill the table below using the information presented in these sections.  Because the data in these sections are given in either ºC or ºF, you need to convert from one unit to the other to fill the table.


Primary Data source

Boiling points in ºC

Boiling points in ºF







CAMEO Chemicals


352° F at 760 mm Hg (sublimes) (NTP, 1992)


Data source

Melting points in ºC

Melting points in ºF




DrugBank (Phys Prop)






CAMEO Chemicals

237.8 °C

460° F (NTP, 1992)

Human Metabolome Database




(b) The reported boiling points of caffeine are actually the sublimation temperature, as indicated in parentheses.  Explain the difference between boiling and sublimation. 



(c) Suggest a reason why PubChem reports the sublimation temperature under the boiling point section?



(d) The sublimation temperature and melting point of caffeine summarized in the table does not seem reasonable if you compare them with each other.  Explain why they are not reasonable.





(e) Below is the schematic phase diagram of caffeine.  Based on this diagram, what kind of information do you need to resolve the data accuracy issue identified in (d).


(f) Among the data sources in the table, how many sources explicitly reported the atmospheric pressure at which the phase transition temperatures were measured?  What are they?



(g) Whenever possible, PubChem presents data with its provenance (source).  For each boiling and melting point value, click the source name to find more detailed information about the data provenance.  In most case, PubChem provides a link to the data reported on the original data source.  Click this link to check the melting and boiling point data on the data sources’ websites.  Does any data source provide additional information about the boiling and melting points of caffeine that may help you figure out the issue with the phase transition temperatures of caffeine?



(h) Provide a short paragraph (less than 100 words) that evaluates the accuracy of the phase transition temperatures in PubChem and other databases, based on your answers to questions (a) ~ (g).


(i) The term “curation” refers to the process of critically reviewing the data as well as checking its accuracy in order to provide the data and related information for the scientific community.  What you did while answering the questions (a)~(h) may be considered as a part of manual curation efforts, and the paragraph written in (h) is essentially a curated review of the phase transition temperature data of caffeine.  Provide the total (estimated) time you spent for answering questions (a) ~ (h), and use it to estimate the amount of time necessary to manually curate the melting and boiling data for 10, 100, 1000, 10000 compounds, respectively.

(j) Curation can be done either manually or in an automated way.  Discuss (in less than 50 words) the pros and cons of manual and automated curations and when each approach would be appropriate.



PART III. Understanding toxicity of caffeine


5. Go back to the PubChem Compound Summary page for caffeine (CID 2519).  Locate the “GHS classification” section (under “Hazard Identification” of “Safety & Hazards”).  This section shows the GHS (Globally Harmonized System of Classification and Labelling of Chemicals) information of caffeine collected from authoritative organizations.  (Click here to read more about the GHS information.) 


(a) By default, the “GHS classification” section shows the GHS information from a single organization. To view information from all organizations, click the “View GHS classification from all sources” at the end of the section.  Based on the information on this page, fill in the table below.



EU Regulation & Austrailia

Japan NITE








GHS hazard statement code











(b) What does each pictogram in the table mean?  Refer to the PubChem GHS classification page (



(c) What does each hazard statement code means?  Refer to the PubChem GHS classification page (



(d) According to the information collected in the table, caffeine has acute oral toxicity.  However, from our experience, we know that the consumption of caffeine in a cup of coffee is usually okay.  To better understand the GHS information on caffeine, it is necessary to review the oral toxicity data of caffeine.  First fill in this table with the oral LD50 values for different animal species compiled in the “Non-Human Toxicity values” sections (under “Toxicological Information” of “Toxicity”).


Oral LD50 value


192 mg/kg







Guinea pig






(e) Fill in the table with the caffeine content in typical food and drinks, which can be found in the “Food Survey Values” section (under “Ecological Information” of “Toxicity”).  To compare these values with the oral LD50 values, divide the caffeine content by the average bodyweight of adult humans (assumed to 60 kg).  Also calculate the per cent value of the caffeine content per kg. relative to the oral LD50 value for mice [that is, caffeine content per kg divided by LD50 (mice)].



Caffeine content

Caffeine content per kg


A single cup of espresso (30 mL)

64 mg

1.1 mg/kg

0.84 %

A 8-oz cup of automatic drip coffee




A cup of tea (camellia sinensis)




Soft drinks per 12 oz




Energy drinks per 12 oz






(f) Find information on human toxicity of caffeine from the “Human Toxicity Values” section (under “Toxicological Information” of “Toxicity”).  Compare this information with the data collected in (d) and (e).



(g) Read the following articles about caffeine.



(h) Based on the information collected from (a) through (g), write a short paragraph (less than 100 words) about the toxicity of caffeine and caffeine-containing products.


Join the conversation.

Comments 15

Bob Hanson's picture
Bob Hanson | Thu, 03/02/2017 - 12:11

Students --  now that you are getting acquainted with how to get chemical information, I have a question for discussion:

What is the pKa of aniline?

Provide us your source as a link. Tell us how that number makes chemical sense.


Sunghwan Kim | Thu, 03/02/2017 - 22:17

I think this is another interesting example.  I think it is worth compiling all kinds of non-sense cases that we find in PubChem (or via PubChem) during this course, giving explanations that make sense out of them, and sharing with the chemical education community.  It would be a very interesting work.

By the way, I am not going to tell you my answer for a while so that other people can enjoy this quiz, too.

Olcc S15 | Sun, 03/12/2017 - 17:36
In reading about Bioactivity data in PubChem, i came about a question about the relation between PubChem and ChEMBL. Mr Kim actually helped me understand the relation between the two. So from what i can get ,depending on the scientific question needed to be answered, we can determine if PubChem or ChEMBL is appropriate. .<a href=""></a>.

Olcc S10 | Mon, 03/13/2017 - 20:38
Hi everyone, I was having trouble on this module about what the difference between data provenance and substance records. Dr. Kim said that the substance records are not the data provenance, however they do contain the data provenance information. I hope this helps any that where confused like I was. Link for hypothesis (<a href=""></a>) Emily

OLCC S197 | Mon, 03/13/2017 - 22:20
HI ALL, initially i asked a question on toxnet databases and i thank hychem 95 for his or her answer .in case you want to know more on the TOXNET databases just use <a href=""></a> (nwume)

OLCC S197 | Mon, 03/13/2017 - 22:20
HI ALL, initially i asked a question on toxnet databases and i thank hychem 95 for his or her answer .in case you want to know more on the TOXNET databases just use <a href=""></a> (nwume)

Olcc S14 | Tue, 03/14/2017 - 08:43
I asked last week about Tanimoto equation and hychem95 helped me to understand that equation and sent me this link. <a href=""></a> You can download this paper from this link and if You want to view the question and answer, use this link. <a href=""></a>

Bob Belford's picture
Bob Belford | Thu, 03/16/2017 - 14:32

I am having some problems using the Identifier Exchange to convert CIDs to canonical smiles.

For input I use CIDs, and have a text file of the CIDs, which I will try and upload with this comment. 

For operator type I have tried "Same, Sterochemistry" and "Same, Connectivity", thinking that would generate isomeric and canonical respectively.

For output method I used "Two column file showing input-output correspondence".

I am attaching the results that I saved to an Excel sheet. This makes no sense to me.

I am also attaching a linked Excel sheet that makes it easier to go to the PubChem pages.

I am also adding the txt file that I uploaded. Does anyone know what I am doing wrong?


Sunghwan Kim | Thu, 03/16/2017 - 15:19

to make sense out of it, let me explain what is going on here for CID 1615 as an example. Please visit the Related Compounds section of the Compound Summary page of CID 1615.

In this section, you can see:

137 compounds with the same connectivity as CID 1615.

134 compounds with the same stereo(chemistry) as CID 1615.

Click these links in a new tab or window, then you will see these compounds with thumbnail images.  These docsum pages are not stored as Entrez history, so combine the results using "NOT", to get the compounds that have the same connectivity but not the same stereochemistry.  (That is, the query looks like:

(History number for the 137 compound set) NOT (History number for the 134 compound set).

(that is, it should be something like #1 NOT #2)

This will give you three compounds (CIDs 102208935, 854031, 667458).  In these compounds, the stereocenters are fully specified (either R or S).  Remember that the query compound CID 1615 that we begin with have an unspecified stereo centers.

So, if you go back to the advanced search builder to take the intersection of the 137 and 134 compound sets, using "AND"

(History number for the 137 compound set) AND (History number fo the 134 compound set).

(That is, something like #1 AND #2)

Now you get 134 compounds that have the same connectivity and stereochemistry (with an unspecified stereocenter).  Just open the five records on this DocSum page in new tabs or windows.  You will see that these are all the same except that atoms at different positions are rare isotopes (like deterium (2H) or C-13).  Basically, all 134 compounds are MDMA with different isotopic compositions, with varying positions of rare isotopes.  This kind of cases occur quite often, when you search for very commonly studied molecules.

With that said, in the excel file you attached, you see many record pairs that involves CID 1615.  For these record pairs, the returned isomeric SMILES strings contain many "2H"'s, meaning that they contain deuteriums at different positions.  [If you selected CIDs as an output (instead of SMILES), you could've got the CIDs in the 137 or 134 compound set that I used in the example above.]

Considering that what you wanted was to compare SMILES strings between stereoisomers, you need to select the "Same, Isotope" option for the operator type.  This will return compounds with the same isotopes as your query (but not necessarily the same stereochemistry).  If you select "SMILES" as an output option, you will see what you expected to see.

Jordi Cuadros's picture
Jordi Cuadros | Thu, 03/16/2017 - 17:09

My understanding is that, if you want just to convert from one ID to another, you should use the "Same CID" operator. Best, Jordi

Sunghwan Kim | Thu, 03/16/2017 - 18:38

I think what he wanted to do was to use the ID exchange service to get the isomeric SMILES of stereoisomers of a set of compounds and check the difference between them.

OLCC S69 | Wed, 03/29/2017 - 10:52
As I was going through the assignment for this module I went through the section in Part 1 where we are instructed to change the URL for the search of compound 60823 to find the non-live compound 733692. How was the CID for the non-live record determined. How would one know to search a certain number to find a non-live record, or is it only used if one had previously used that CID and it later became non-live. Then, the person could alter the URL to go back and find that compound they had previously searched for?

Sunghwan Kim | Wed, 03/29/2017 - 13:49

Detailed information on non-live records can be found at one of the papers listed at the beginning of Question 1.

But the bottom line is that you *can't* find non-live records because, by definition, they are not searchable. However, it is still accessible if you know "where the record is".

Probably, somewhat confusing to understand, so let me explain it this way. Essentially, a URL, which stands for "Uniform Resource Locator", is the address of the record you are interested in. When you modify the address of CID 60823 to that of 733692, it means that you already know where CID 733692 is. (You just told your web browser the address of CID 733692. This is not the same as searching for the record.) With that said,......

>>> How would one know to search a certain number to find a non-live record?

Again, non-live records are not searchable. You should know the existence of a particular non-live record and what its web address is.

>>> Is it only used if one had previously used that CID and it later became non-live.

All records in PubChem Compound (whether live or non-live) entered into PubChem as live records because at least one depositor provided information on associated substances. Each compound record is assigned to its own address. To access that record, you would (metaphorically) search *an address book* for that record. However, once you know where the record is, you can get that record even if it is dropped out of the address book (that is, not searchable but still accessible).

OLCC S66 | Wed, 03/29/2017 - 22:38
I've been looking through several different sources and may be overlooking the answer, but I was wondering what the time frame is for a depositor to be considered inactive and for their records to take on the legacy designation? Also, when it comes to removing the legacy designation, is that a manual process of the depositor making contact with PubChem? Or is that in some way semi-manual and semi-automatic as well? For example, if a depositor (previously considered inactive) again adds information to PubChem, would that automatically make the other records lose the legacy designation, or would that depositor have to make contact with PubChem prior to adding a new record? Basically, I'm slightly confused about the order, but I'm not sure exactly how to phrase the question.

Sunghwan Kim | Thu, 03/30/2017 - 09:46

Currently, both adding and removing a legacy designation are being done manually.


It is true that we have *automated* check if legacy depositors submit new data into PubChem. However, the final decision whether to remove a legacy designation is manually done.


[It is expected to be (and really is) extremely rare for a legacy depositor to start submitting new data because it stopped data submission for a reason (mostly money-related issues, e.g., discontinued funding or product, strategic priority changes, business gone under, etc.) and this reason is not likely to be changed by PubChem's legacy designation.]