1.10. Special notes on data exchange/integration

        All the databases mentioned above are public databases that provide their contents free of charge, and in many cases, they also provide a way to download data in bulk and integrate them into one’s own database.  Therefore, it is very common that different database groups exchange their information with each other.  This often raises some technical concerns.  For example, as mentioned in Module 5, different databases may use different chemical representations to refer to the same molecule.  This may result in incorrect chemical structure matching between the databases, leading to incorrect data integration.  In addition, when one database has incorrect information, this error often propagates into other databases.

No votes yet
Join the conversation.

Comments 4

John Turner (not verified) | Mon, 10/19/2015 - 18:15
I see you talk about ways that they allow downloading the data in bulk, but do any of them off apis so that you don't need to download bulk data ?

Jordi Cuadros's picture
Jordi Cuadros | Mon, 10/19/2015 - 23:43
Discussion on APIs will arrive in Module 8. We'll get there.

Sunghwan Kim | Mon, 10/19/2015 - 23:49
Many databases provide programmatic access routes. For example, PubChem provides several programmatic methods, including (1) E-Utilities, (2) Power User Gateway (PUG), (3) PUG-SOAP, and (4) PUG-REST. Among these methods, PUG-REST is the simplest and easiest to learn. PUG-REST encodes into a one-line Uniform Resource Locator (URL) all information necessary for accessing a particular information in PubChem. For example, the following URL retrieve data for Compound ID (CID) 2244 and 1983 from PubChem in XML format: <a href="http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244,1983/record/XML">http://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2244,1983/record/XML</a> An introductory material on programmatic access to PubChem is available on the following link: <a href="http://nar.oxfordjournals.org/content/43/W1/W605">http://nar.oxfordjournals.org/content/43/W1/W605</a> Alternatively, you can download a set of records that you are interested in, using the PubChem Structure Download and Assay Download tools. <a href="https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi">https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi</a> (for structure download) <a href="https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi">https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi</a> (for assay data download) The structure download tool allows users to download up to 500,000 compounds. To download more than this number, you can chunk your compound list into small pieces of 500,000 compounds. However, if you have a very long list of compounds to download, you should bulk download from the PubChem ftp site (<a href="ftp://ftp.ncbi.nlm.nih.gov/pubchem/">ftp://ftp.ncbi.nlm.nih.gov/pubchem/</a>).

Sunghwan Kim | Thu, 10/22/2015 - 18:05
The Oxford University Press publishes a journal called “Nucleic Acids Research”. Every January, this journal publishes a special “Database” issue that covers a wide variety of databases useful to their readers. It also maintains/updates a list of important databases so that people can see what databases are available to the scientific community. Please visit this database collection available at: <a href="http://nar.oxfordjournals.org/content/39/suppl_1/D1/suppl/DC1">http://nar.oxfordjournals.org/content/39/suppl_1/D1/suppl/DC1</a>. Several Database Issue papers that cover the databases mentioned in this class have been published in this journal as advanced articles. They are: <a href="https://doi.org/10.1093/nar/gkv951">https://doi.org/10.1093/nar/gkv951</a> (PubChem) <a href="https://doi.org/10.1093/nar/gkv1031">https://doi.org/10.1093/nar/gkv1031</a> (ChEBI) <a href="https://doi.org/10.1093/nar/gkv1047">https://doi.org/10.1093/nar/gkv1047</a> (PDBe) The following papers also deal with popular chemistry databases. I suggest that you skim through the title and abstract of these papers to see what kind of information is available to the public. <a href="https://doi.org/10.1093/nar/gkv1072">https://doi.org/10.1093/nar/gkv1072</a> (BindingDB) <a href="https://doi.org/10.1093/nar/gkv1075">https://doi.org/10.1093/nar/gkv1075</a> (SIDER) <a href="https://doi.org/10.1093/nar/gkv1037">https://doi.org/10.1093/nar/gkv1037</a> (The guide to pharmacology)