Discussion | DivCHED CCCE: Cheminformatics OLCC

Statistical analysis

Should we perform some kind of statistical analysis to determine if the ocs concentration is significantly different or just make note that the means are higher or lower?

I noticed in this module that you touched on versioning data files. In my research we rely heavily on version control with our servers to not only preserve data, but also the system and the cloud storage used to maintain this data. We mostly use git on Linux based servers and this seems to be a widely used system for those in the programming and computer science fields. In my opinion the benefits of such a system are endless as it can be automated, data can be encrypted and pushed to remote systems to take care of the 3-2-1 rule and it takes up less storage as only the changes to a file or data-set is saved in each new version. My question is do you see a future in which all branches of science may adopt these types of systems for storing their data?

Backup in industry

I work at a company who also uses cloud storage and of course data integrity is a very important issue. Some (big) companies (including Google) store tapes so they have a backup in case of emergency. We are quite a small company and we don't do any on-site storage. Instead we use advanced replication technologies which copies our data to multiple data centers. Actually this is also what companies like Google do with their data. If one copy of the data is lost there are still a number of copies left that are actively updating each other and adding new copies if required (so data can be inserted in any copy, in a master/slave system the data is routed to the master and then copied to all slaves, if the master goes down a new one is elected, in a master to master system each copy can add data and send it to the other members of the cluster). The PDB (Protein Data Bank) is also an example of a very important academic dataset which is mirrored by different, in this case, separate databases. These technologies actually make cloud storage extremely reliable. (still, our company almost suffered serious data loss after firmware updates on our hardware, this is why using multiple data centers is very important)

Re: CC0 vs. Public Domain

There's basically no difference. CC0 is Creative Commons' way to revoke your rights over content to effectively put it into the public domain (stuff falls into the public domain naturally over time and CC0 lets you speed up this process). It's also really useful to clear up cases where rights like copyright may or may not exist. The one thing to note about CC0 and the public domain is that it's still ethically responsible to cite your sources, even if those sources are licensed under CC0 or in the public domain.

CC0 vs. Public Domain

Is there a functional difference in how one is allowed to use materials under a CC0 license and materials in the public domain?

Re: Data Management

I agree with Dr. Briney. And I'd also like to add that the research work in a specific type of industry is comparatively homogenous than exploratory research works in academia, which makes the standardization comparatively easier to carry out in industry. In addition, in both industry and academia, many existing practices around data organization and management may be sufficient for using those data within their small research team; but they may not be good at all for the new context of sharing data and enabling (machine) reuse of data for a broader research community. Therefore, we are exploring these recommended practices at the basic level with the broader data sharing and reuse in mind in the hope that they can be useful for you no matter what kind of career path you take.

RE: Data security

I agree! I've seen WAY TOO MANY horror stories about students losing copies of their theses and don't want anyone else to fall into the trap. Do realize that off-site storage, particularly cloud storage, also isn't full-proof; companies fold, have technical issues, and lose data. That's why it's so important to keep both on-site (ie. in your direct control) and off-site copies of your data.

Re: Data Management

Like most things, the answer is "it depends". While industry certainly has more leeway to mandate certain data management practices (you often see these around lab notebooks), that doesn't necessarily mean that practices are better in industry that academia. The plain truth is that we're yet not teaching this stuff to researchers consistently, so most people cobble something together on their own, no matter if they work in academia or industry. The one exception is likely to be those researchers with a lot of data/doing a lot of computation because they're forced to think about some of this stuff (data consistency, file naming, etc.) to make their analysis run smoothly. So the answer is that it more depends on the scale of your data than where you work.

Data security

I really wish that more people followed the 3-2-1 Rule. Store your data off-site!

Data Management

Do you feel that poor data management occurs more frequently in academia versus the industry where there should be standard operating procedures to avoid poor data management?