2.3 Common Computer Files and Formats

The mention of binary files in the previous sections highlights that there are differences in how data is stored by computers.  Traditionally, programs or applications are stored in binary format because they contain code that is used to allow the program to run and thus should not be displayed as text (its not readable anyway).  In addition, there are many non-text based files, images, audio, video etc. that need to store their content in a format that is efficient – a place where UTF-8 (ASCII) is not needed. Historically though the most important reason for applications and files being stored as binary is that it gave the developer a way to protect their work and make money off of it.  If the application they have written stores the files in a proprietary format (one that only the application can read and write) then no one else can work with the file without licensing the software used to create it.

In the early years, without the presence of the Internet this model worked well.  But once we all became connected and more and more armchair software developers started developing free software, a significant push back started with this model.  People wanted to be able to share documents and not have to pay high licensing fees just to be able to read a letter they sent out to their relatives.  Although a lot of progress has been made in making file formats open and based on community standards, there are still a lot of issues will proprietary file formats, especially in the sciences.  If you think about it, any data you get off of an instrument is stored in a proprietary format – if you want to get the data into Microsoft Excel for instance you have to export it.  In a lot of cases you are not getting all the information that was gathered – but more about that later.

Microsoft is a good example of a company that used a proprietary binary file format for saving documents from all its applications, that has transitioned to a open standard text based format.  In 2000 Microsoft started producing versions of its files (starting with Excel) in what was to become the Open Office eXtensible Markup Language (OOXML), a standard released by ECMA in December 2006 (see http://www.ecma-international.org/publications/standards/Ecma-376.htm).   XML is a text based markup language published by the World Wide Web consortium (http://www.w3.org/XML/) that is used to annotatate text strings using tags.  Hypertext Markup Language (HTML) used in web browsers is a limited version of XML primarily used for presentation.

Comparing the binary Word document format (.doc file) with the new OOXML (.docx file) format and plain text (.txt file) we can see the differences (these files are attached and the bottom of the page).

Word .doc fileWord .doc file properties

The Word .doc File Format

Word .docx file nativeWord .docx file native properties

The Word .docx File Format

We see that the .doc file is unintelligible in a text reader and has over 22K characters - large considering the file contains only the text ‘Chemical Informatics’.  The .docx file is also unreadable and slightly larger in size – but wasn’t this supposed to be in the OOXML text based format?  It actually is, however the .docx file is actually a folder of XML files that is ZIP compressed into a single file.  If you change the extension of a .docx file to .zip and un-compress it you see the folder of XML documents (that take up a lot of space – a problem with XML).

Word .docx unzippedWord .docx unzipped properties

Finally, if you look inside the ‘document.xml’ file you find the text content.  The other files in the folder are used to store information about the font used, the Word settings, the styles in use and a n image of the file used by the operating system to display a thumbnail of the file content.  The ‘document.xml’ contains the text of the file and other ‘markup’ – the XML – that provides the structure of the file , the layout of the page, and other important information

Word .docx XML fileWord .docx XML file properties

Of course if you just need the text (i.e. no formatting etc.) then you could use the .txt format.  It's got the same size 21 bytes, as the length of the text (including the carriage return as the end of the sentence).  Sometimes less, is more…

Text fileText file properties

Additional Material

 

Additional Resources

Rating: 
0
No votes yet
Join the conversation.

Comments 4

Brandon Davis (not verified) | Thu, 09/10/2015 - 16:37
I thought that the xml document format was supposed to be smaller than the older office files. I saved two files that each said "cheminformatics" and the .docx one was about half the size of the .doc. Is there a setting in word that adds extra compression?

Stuart Chalk's picture
Stuart Chalk | Thu, 09/10/2015 - 16:48
Did you mean XML > .doc? If that's the case then, yes XML files tend to be large because of all the tags in them. However, Microsoft realized that and so because of size, and the need to have multiple XML files to represent a Word document they zipped the folder of files and then changed the extension to .docx. The difference in size for .docx files varies though because if image files are stored inside the folder they do not compress much as they are normally already compressed. So, you will see the biggest size difference for 'text only' Word documents. If I have misinterpreted your question let me know...

Brandon Davis (not verified) | Sat, 09/12/2015 - 12:38
I saved a file in 2 formats like in the article, and the docx file was 12kb and the .doc was 22kb. I was curious what factors contribute to the discrepancy between my filed and those described in the article.

Stuart Chalk's picture
Stuart Chalk | Sat, 09/12/2015 - 16:59
It probably depends on things like the default font you have setup, the styles that are in use, and whether or not the .docx is compatible with the older versions of Word. It probably also depends on the OS, the size of the hard drive (large drives are chunked into larger pieces, and version of Word you are using.

Annotations