Thursday, December 1, 2016

Efficiently Saving and Sharing Data in R

After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat file).

It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users.

In this post I test four different file formats available to R users. These formats are comma separated values csv (write.csv()), object representation format as a ASCII txt (dput()), a serialized R object (saveRDS()), and a Stata file (write.dta() from the foreign package). For reference, rds files seem to be identical to Rdata files except that they deal with only one object rather than potentially multiple.

In order to get an idea of how and where different formats outperformed each other I simulated a dataset composed of different common data formats. These formats were the following:

Numeric Formats

• Index 1 to N - ex. 1,2,3,4,...
• Whole Numbers - ex. 30, 81, 73, 5, ...
• Big Numbers - ex. 36374.989943, 15280.050850, 5.908210, 79.890601, 2.857904, ...
• Continous Numbers - ex. 1.1681155, 1.6963295 0.8964436, -0.5227753, ...

Text Formats

• String coded factor variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd ...
• String coded factor variables with 16 characters coded as strings
• String coded factor variables with 64 characters coded as strings
• Factor coded variables with 4 characters - ex. fdsg, jfkd, jfht, ejft, jfkd - coded as 1,2,4,3,2, ...
• Factor coded variables with 16 characters
• Factor coded variables with 64 characters
• String variables with random 4 characters - ex. jdhd, jdjj, ienz, lsdk, ...
• String variables with random 16 characters
• String variables with random 64 characters

What type of format a variable is in is a predictive characteristic for how much space that variable takes up and therefore how time consuming that variable is to read or write. For variables that are easy to describe they tend to take up little space. An index variable in an extreme example and can take up almost no space as it can be expressed in an extremely compact format (1:N).

In contrast numbers which are very long or have a great degree of precision tend to have more information and therefore take more resources to access and store. String variables when filled with truly random or unique responses are some of the hardest data to compress as each value may be sampled from the full character spectrum. There is some significant potential for compression when strings are repeated in the variable. These repetitive entries can be either coded as a "factor" variable or a string variable in R.

As part of this exploration, I look at how string data is stored and saved when coded as either a string or as a factor within R.

Raw Files

Let's first look at space taken when saving uncompressed files.

Figure 1: File Size

Figure 1 shows the file size of each of the saved variables when 10,000 observations are generated. The dataframe object is the data.frame composed off all of the variables. From the height of the dataframe, we can see that rds is overall the winner. Looking at the other variable values we can see only that csv appear to consistently underperform for most file formats except for random strings.

Figure 2: File Size Log scaled

In Figure 2 we can see that rds is consistently outperforming all of the other formats with the one exception of index in which the txt encoding simply reads 1:10000. Apparently even serializing to bytes can't beat that.

Interestingly, there does not appear to be a effective size difference between repetitive strings encoded as factors accounting for the size of the strings (4, 16, or 64). We can see that the inability of csv to compress factor strings dramatically penalizes the efficiency of csv relative to the other formats.

File Compression

But data is rarely shared in uncompressed formats. How does compression change things?

Figure 3: Zipped File Sizes Logged

We can see from Figure 3 that if we zip our data after saving, the file size can do pretty much as well as rds. Comma delineated csv files are a bit of an exception with factor variables suffering under csv. Yet random strings perform slightly better under csv than other formats. Interesting rds files seem slightly larger than the other two file types. Overall though, it is pretty hard to see any significant difference in file size based on format after zipping.

So, should we stick with whatever format we prefer?

Not so fast. Sure, all of the files are similarly sized after zipping. This is useful for sharing files. But having to keep large file sizes on a hard drive is not ideal even if they can be compressed for distribution. There is finite space on any system and some files can be in the hundreds of MB to hundreds of GB range. Dealing with file formats and multiple file versions which are this large can easily drain the permanent storage capacity of most systems.

But an equally important concern, is how long it takes write and read different file formats.

In order to test reading speeds, I loaded each of the different full dataframe files fifty times. I also tested how long it would take to unzip then load that file.

Figure 4: Reading and unzipping average speeds

From Figure 4, we can see that read speeds roughly correspond with the size of files. We can see that even a relatively small file (30 MB csv file) can take as long as 7 seconds to open. Working with large files saved in an inefficient format can be very frustrating.

In contrast, saving files in efficient formats can dramatically cut down on the time taken opening those files. Using the most efficient format (rds), files could be 100 times larger than those used in this simulation and still open in less than a minute.

Conclusions

Finding common file formats that any software can access is not easy. As a result many public data sets are provided in archaic formats which are poorly suited for end users.

This results in a wide pool of software sweets having the ability to access these datasets. However, with inefficient file formats comes a higher demand on the hardware of end users. I am unlikely to be the only person struggling with opening some of these large "public access" datasets.

Those maintaining these datasets will argue that sticking with the standard, inefficient format is the best of bad options. However, there is no reason they could not post datasets in rds formats in addition to the outdated formats they currently exist in.

And no we need not argue that selecting one software language to save data in will be biased toward those languages. Already many federal databases come with code supplements in Stata, SAS, or SPSS. To access these supplements, one is required to have paid access to that software.

Yet, R is free and its database format is public domain. Any user could download R, open a rds or Rdata file, then save that file in a format more suited to their purposes. None of these other proprietary database formats can boast the same.