Discussion | DivCHED CCCE: Cheminformatics OLCC

API for Chemical Data

I never knew that these sites published APIs, this kind of data would be great to create one's own pathway finder for synthesis.

It might be helpful in this conversation to distinguish a bit between processes associated with evaluating credibility of what is usually meant by "literature" (e.g. journal articles, book chapters, written word), and data sets (often numeric based, direct from research or compiled). Research data and written articles are related types of information in scientific research and not always completely separable, but they have tended to be distinguished in their paths to initial publication, how they are made discoverable by indexing services and available in libraries, how they are evaluated before and after publication and how they are used. As Kristin suggests, these lines are blurring somewhat with new modes of handling data and publishing in the digital environment. However, it is still useful for everyday research and also helpful when evaluating sources to be aware of these distinctions. For example, this Module 1 is primarily focusing on searching, evaluating and organizing written word literature; how to manage data sets are further considered in Module 3 and in other lectures throughout the course. I posted separately on evaluating the secondary literature databases discussed in this Module. The "data" in some of these sources primarily consist of citations and abstracts of articles, such as the Web of Science; other databases reprint numeric and process data from articles and other sources, such as Reaxys. The older types have traditionally been indexed by human professionals in part as another layer of review, such as Chemical Abstracts. More recent databases may be extracted by algorithm, such as PubChem, which reprints data from other compilations "as is" with a full disclosure of the source for the user to consider directly in their own evaluation process. As Bob mentions, databases may not pick up on the types of human sourced errors highlighted in "Retraction Watch", a blog that focuses primarily on data published within articles. Another vote for researchers to review data, and the rationale published with it either in a more traditional analysis or a "data paper", before using in their own work. A few other approaches to data evaluation related to chemistry in particular that are worth mentioning briefly as they will not be otherwise covered are crystallographic data and materials property data. The crystallographic data research community has formulated over time a process for peer review of these data through a robust standard file format that enables automated validation checks as well as human expert review. The data are concurrently published in a sustainable repository that supports both open retrieval of individual data sets through the original publication and subscription based analysis software (Cambridge Crystallographic Database, <a href="http://www.ccdc.cam.ac.uk/pages/Home.aspx">http://www.ccdc.cam.ac.uk/pages/Home.aspx</a>). The National Institute of Standards and Technology employs rigorous data evaluation strategies for various types of materials properties reported from various sources based on binary assessment of acceptable vs. non-acceptable for re-use in industry applications. An early version of a decision tree for ceramics materials is the NIST Interactive Data Evaluation Assessment Tool that identifies several levels of evaluated data, including certified, validated, qualified, commercial, typical, research and unevaluated (IDELA, <a href="http://www.ceramics.nist.gov/IDELA/IDELA.htm">http://www.ceramics.nist.gov/IDELA/IDELA.htm</a>). The more recent NIST ThermoData Engine employs similar types to dynamically evaluate data based on large compilations of experimental data (<a href="http://trc.nist.gov/tde.html">http://trc.nist.gov/tde.html</a>), the latest effort in a long history to systematically collect and evaluate data for re-use.

Evaluating information sources for research purposes

Hi Judat, thank you for the comment. This presents a great opportunity to discuss evaluating information sources for research purposes. It is not enough to assume credibility of a resource based solely on the fact that it exists. And with seeming more information available through the Web than ever before, it becomes a critical part of the research workflow to discern the best quality of sources for the research question at hand. It can be helpful to look at range of characteristics for evaluating the reliability of a source, including depth of content, currency, authority, accuracy and bias. All of these characteristics will vary among sources based in part on the purpose for which they were developed, how they are structured and maintained, and by whom, which can be usually be determined from the About Pages. A list of rubrics for assessing information sources is available at: <a href="http://railsontrack.info/rubrics.aspx?catid=6">http://railsontrack.info/rubrics.aspx?catid=6</a> . This quick chart from McHenry College is fairly representative and easy-to-use, <a href="http://www.mchenry.edu/library/tutorial/pdf/EvaluatingSourcesRubric.pdf">http://www.mchenry.edu/library/tutorial/pdf/EvaluatingSourcesRubric.pdf</a> . With increasing access and easy tools and devices for using the Internet and lowering the barriers for public posting of information, more sources are available but at a greater range of potential quality. It is not enough to assume the first free source is the best. The decision of what is most appropriate to use ultimately remains with the researcher, hopefully an informed decision. Considering a few of the specific databases mentioned, Chemical Abstracts (aka, SciFinder) has a goal to be a comprehensive source of information on characterized compounds to support competitive R&D in the chemical industry, among other uses. This database indexes daily a broad range of publication types, from articles to conference proceedings to patents in chemistry and related sciences in more or less depth depending on the relevance. The information is abstracted and indexed by degreed chemists employed by the largest chemical scientific society based on chemical structures and fairly detailed terminology, and it is a substantial cost for institutions to subscribe. PubMed provides similarly broad coverage of the biomedical research literature with a robust medical focused index maintained by professional staff scientists and librarians at a national library, and the cost is subsidized for public use by a federal government. Both of these sources would pass most criteria for the evaluation characteristics mentioned above, although only one is subscription based.

PubPeer

This is an interesting "easy-read" article for your holiday, that I do not think you need a subscription to access, <a href="http://www.nature.com/news/pioneer-behind-controversial-pubpeer-site-reveals-his-identity-1.18261?WT.mc_id=TWT_NatureNews">http://www.nature.com/news/pioneer-behind-controversial-pubpeer-site-reveals-his-identity-1.18261?WT.mc_id=TWT_NatureNews</a> Cheers, Bob

RE: Credibility of data

There is definitely a lot of work being done to make more data openly available and to increase the quality of this data at the same time. Beyond just posting data when you publish an article - something that funding agencies are now strongly encouraging - there is a new mode of publishing called a "data paper". This is where you publish a paper *about a dataset* (not about an analysis on the dataset) and both the data paper AND the data are then peer reviewed and published. The Nature journal "Scientific Data" is one of the big players in this area but other journals, like PLOS, also publish data papers. Data peer review is still a new topic and generally involves: reviewing the dataset for consistency/errors, checking that the documentation completely and logically describes the dataset, and evaluating the importance/relevance of the data. Between data sharing, data papers, data mining, and other new types of scholarship, we're going to be seeing many changes to the way we research and publish going forward!

I am a little curious as to

I am a little curious as to how strong "gray" bibliographic entries are in the peer reviewed arena.

Science blogs

Science blogs are an important "gray" resource. Formerly, the primary way scientists could communicate with the public was by journal publications or talks. Because science reporting in popular media is generally abysmal, working scientists publishing blogs is valuable as a platform for discussion and criticism of current science topics in the news. This is a good example: https://www.sciencebasedmedicine.org . We can't be experts in every field, so it's useful to have knowledgeable experts to evaluate technical articles for the rest of us.
Is anyone familiar with any chemistry centric blogs?

Excellent point

Excellent point. It leads me to ask just how effectively can scientific literature be vetted for accuracy when access is restricted to private libraries. More needs to be done to push the scientific community towards open access of information.

Often, I find that the

Often, I find that the easiest way to "Open" files is to drag them onto Zotero or onto Chrome (or in some cases, Firefox). I do use a Mac, though. I have often found that Zotero has already installed the MS Word extension without me needing to download anything (it comes with the Standalone one). You can also install the MS Word Plugin through the Zotero Standalone "Preferences" or "Settings" panel (On a mac, it's Preferences -> "Cite" options). If it says "Reinstall Word Plugin" then you've already got it installed! -Justin

Credibility of data

Hi All, I was actually hoping there might be some initial discussion on this post, and I think this is a topic that will come up multiple times this semester. How do we evaluate data? I know some very reputable scientists who will only publish data if it is open. In fact one could argue that in this evolving world of Big Data, the data needs to be open. There are also national funding agencies that require taxpayer funded work to be open. Also, it is my understanding that many subscription based services actually mine the primary literature, which adds a layer of potential error, without consideration of issues like retraction watch, http://retractionwatch.com/. I am really curious what people with actual knowledge and experience in these issues think, and hope this topic will come up in many of the modules as we progress through the semester. I thank you for bringing up this topic.