3.2 File Organization

File Organization

Organization helps you find and sort through your data and makes it easier to use your data in the future. Save yourself time in the future by making sure that your files are organized as you create them.

The most important thing for organization is to have a system and use it consistently. This will help you track down files when you need them and not waste time combing through useless information.

Here are several options for organization:

  • By project
  • By analysis type
  • By date
  • By researcher
  • By thesis chapter
  • By site or data source

You can also use these systems in combination.

Basically, figure out a system that works for your data (does not have to be listed here) and stick to it. You should also document your organization system in your lab notebook or another prominent place.

Here are several examples of file organization systems:

Experimental data:

  • By experiment
    • By file type (raw data, analyzed data, figures, etc.)

Collaborative project data:

  • By researcher
    • By project
      • By date

 

Adapted from the data management guide by UWM Libraries (http://guides.library.uwm.edu/data), CC-BY.

 

File Naming

HORROR STORIES:

A file naming convention add standardization to your files, making them much easier to organize and locate. It will also help your colleagues sort through your files should you fall ill or leave the lab. Your naming scheme should be documented in your laboratory notebook (preferably at the front or back for easy access) or in a prominent place for this reason.

There are conventions available for you to choose from, though you will probably want to customize one for your own purposes. There are a few general tips for creating systems for naming files.

First, pick a group of files that you wish to name consistently and decide on the key information that will distinguish one file from another. Pick 2-3 things that will tell you a file's contents. Examples are:

  • Date
  • Site
  • Analysis
  • Sample
  • Short description

Once you pick your key pieces of information, arrange them into a pattern using the following rules:

  • Files should be named consistently
  • Files names should be descriptive but short (<25 characters)
  • Use underscores instead of spaces
  • Avoid these characters: “ / \ : * ? ‘ < > [ ] & $

You can also add version information, as necessary. Versioning can be imminently helpful when you are analyzing data. If you make a change to your data that you don’t want to keep, it’s simple to go back to an earlier version of the file. The same is true if a file gets corrupted or if you simply want to change your analysis method. The key to making versioning work is being consistent with version names, periodically saving to new versions, and documenting the differences between versions.

  • For analyzed data, use version numbers
  • Save files often to a new version
  • Label the final version FINAL

Using these guidelines, here are some example naming conventions and example file names. The first example, in particular, is useful for organizing .pdf’s of journal articles.

  • AuthorLastName-Year-Title
    • Smith-2010-ImpactOfStressOnSeaMonkeys
    • Hailey-1999-VeryImportantDNAStudy
  • YYYYMMDD_site_sampleNumber
    • 20140422_PikeLake_03
    • 20140424_EastLake_12
  • Experiment_Analysis_Version
    • KMnO4_FirstOrder_v2
    • HCl_ZeroOrder_v5

 

Adapted from “Starting Small: File Naming Conventions” by Kristin Briney (http://dataabinitio.com/?p=14/), CC-BY, and the data management guide by UWM Libraries (http://guides.library.uwm.edu/data), CC-BY.

 

Dates

The standard ISO 8601 is incredibly useful for data management. This standard concerns dates, a common type of information used for data and documentation. To understand why this standard is important, consider the following dates:

  • March 5, 2014
  • 2014-03-05
  • 3/5/14
  • 05/03/2014
  • 5 Mar 2014

All of these represent the same date but are expressed in different formats. The problem is that if someone uses all of these formats in her notes, how will you ever find everything that happened on March 5th? It’s simply too much work to search for all the possible variations. The answer to this problem is ISO 8601.

ISO 8601 dictates that all dates should use the format “YYYYMMDD” or “YYYY-MM-DD”. So the example above becomes “20140305” or “2014-03-05”. This provides you with a consistent format for all of your dates. Such consistency allows you to more easily find and organize your data, the hallmark of good data management.

ISO 8601’s consistency is great but is particularly useful when you use it at the beginning of file names. This is because dates using this standard sort chronologically by year, by month, and then by date. So if you date all of your file names using ISO 8601, you suddenly have a super easy way to find and sort through information.

Adapted from “Dating Your Data (or How I Learned to Stop Worrying and Love the Standard)” by Kristin Briney (http://dataabinitio.com/?p=449), CC-BY.

 

 

Rating: 
0
No votes yet
Join the conversation.

Comments 6

Brian Murphy | Tue, 09/22/2015 - 11:27
Do you feel that poor data management occurs more frequently in academia versus the industry where there should be standard operating procedures to avoid poor data management?

Dr. Briney | Tue, 09/22/2015 - 14:09
Like most things, the answer is "it depends". While industry certainly has more leeway to mandate certain data management practices (you often see these around lab notebooks), that doesn't necessarily mean that practices are better in industry that academia. The plain truth is that we're yet not teaching this stuff to researchers consistently, so most people cobble something together on their own, no matter if they work in academia or industry. The one exception is likely to be those researchers with a lot of data/doing a lot of computation because they're forced to think about some of this stuff (data consistency, file naming, etc.) to make their analysis run smoothly. So the answer is that it more depends on the scale of your data than where you work.

Ye Li | Tue, 09/22/2015 - 14:30
I agree with Dr. Briney. And I'd also like to add that the research work in a specific type of industry is comparatively homogenous than exploratory research works in academia, which makes the standardization comparatively easier to carry out in industry. In addition, in both industry and academia, many existing practices around data organization and management may be sufficient for using those data within their small research team; but they may not be good at all for the new context of sharing data and enabling (machine) reuse of data for a broader research community. Therefore, we are exploring these recommended practices at the basic level with the broader data sharing and reuse in mind in the hope that they can be useful for you no matter what kind of career path you take.

OLCC s12's picture
OLCC s12 | Wed, 09/23/2015 - 23:30
I noticed in this module that you touched on versioning data files. In my research we rely heavily on version control with our servers to not only preserve data, but also the system and the cloud storage used to maintain this data. We mostly use git on Linux based servers and this seems to be a widely used system for those in the programming and computer science fields. In my opinion the benefits of such a system are endless as it can be automated, data can be encrypted and pushed to remote systems to take care of the 3-2-1 rule and it takes up less storage as only the changes to a file or data-set is saved in each new version. My question is do you see a future in which all branches of science may adopt these types of systems for storing their data?

Dr. Briney | Thu, 09/24/2015 - 09:21
That's great that you are using git for version control! This are definitely more scientists getting into this, and similar, systems and I expect even more to adopt it in the future. However, I'm not sure we'll get to a point where every scientist will use a version control system like git, mainly because there is a lot of overhead to learn git and not everyone needs such a powerful system. For example, simple version control systems like labeling files with a version number - eg. "v01", "v02", ..., "FINAL" - are easier to adopt and can still help manage information (even though, as you noted, this takes more space). So, overall, I think we're going to see a lot more scientists adopt tools like git but it won't be everyone.

OLCC s12's picture
OLCC s12 | Thu, 09/24/2015 - 15:03
I completely agree with your response and it did take me a while to learn to use git because of the learning curve as you mentioned.

Annotations