Thursday, March 20, 2014

It is time for RData files to become the standard for Data Transfer


It is time Rdata files become the primary means of disseminating publicly available data online.

1. R is the most efficient Statistical software at compressing data

I was recently attempting to download weather data from the US government and found myself stymied because the dataset I wanted was considered too large (+5 gigs).  The problem I realized was not that I wanted too much data but that the data transfer format was so poor.  The only datasheet form available was csv.  I was therefore forced to drop many variables and resubmit the data request.

Ultimately, I downloaded only some of the pieces of the data which ended up being a file 627  MB in size.  Importing the data into R via the read.csv command and immediately saving it as an “RData” file reduced the size to 55.3 MB (92% reduction).  As a matter of comparison, I imported data into Stata 12 and then saved it in Stata’s native format which resulted in a reduced size of 318.2 MB (49% reduction).  I compared the zip of both the csv file and the Stata file as well as the R file.  The R file zipped only had trivial gains in size of 54.3 MB while the compressed Stata file made considerable gains, taking up only 79.5 MB.  The CSV file compressed to zip still performed the worse, taking up 120.4 MB.


Native Format
Read Time
Zipped
Comma Separated Values
627.7 MB
-
120.4 MB
R
55.3 MB
1.12
54.3 MB
Stata
318.2 MB
1.24
79.5 MB

I also clocked how long it would take to read this data into R vs Stata and found that the difference in read times was not substantial.  However, this should not be assumed to be the case for all systems since I am running with a solid state drive which has much higher read speeds than traditional magnetic hard drives.

Looking at other software, I downloaded another dataset recently in which the data was provided in four different formats, Borland Database Format (DBF) 130MB, Microsoft Access Database (MDB) 110MB, and SPSS/PAWS (SAV) 45MB.  After importing the data into R and saving it as an Rdata file, my resulting Rdata file only took up 3.2 MB.

Size
Zipped
Borland Database Format (DBF)
130 MB
4.5 MB
Microsoft Access Database (MDB)
110 MB
7.2 MB
SPSS/PAWS (SAV)
45 MB
4.8 MB
R (Rdata)
3.2 MB
3.1 MB

This efficiency alone presents a significant case as to why one should save and distribute data in Rdata files when possible.

2. R codex is open access

This may not sound like a big deal but the open source nature of R makes it extremely easy to transfer data from R into any other program.  Quick R gives sample code that can be used to easily save (or read) data to (from) SPSS, SAS, or Stata.  In addition to providing an easy means of transferring data between statistical programs, R does not face issues relating to lack of backwards compatibility.

Stata for instance only allows data sets to be saved backwards for up to three or four previous versions.  Thus if you are running Stata 12, 11, or 10 you can only save data sets so that they are compatible with users using Stata 8 or later (see the Stata help topic).  This practice on behalf of Stata seems unnecessarily harsh since it is in effect forcing users to upgrade their version of Stata if only to access data saved by users who can afford access to later versions of Stata.

I suspect that this is not primarily an issue with Stata but one relating to proprietary software in general.  Proprietary software companies would like to encourage soft handedly or strong handedly the purchase of newer software even if it is at the expense of current users.  

However, this issue does not exist using R.  Thus, you can be assured that by saving data in R, anybody should be able to access your data.

3.  R Projects are Easily Bundled

Different types of data files allow for different levels of embedded descriptive information such as Stata’s variable labels.  As far as I know, R has the most extensive option available for bundling information into a single file.  Not only can R save the data files and descriptive labels in a single bundle but functions which are specific to use with the data may be included as well in the same bundle.  For example, if you are working with data on health you may be interested in not only having the BMI index and other health indexes for each individual but also having the functions that calculate these indexes.  Including these functions is simple within an Rdata file.

It may not be clear to some users that this is really a large advantage because nearly all statistical software packages as far as I know include the possibility of producing external script files.  R has this option as well.  The advantage to R is that it not only includes that option but also includes the option of embedding complex or unique functions within Rdata files.

So R, Now What?

If you accept that R is an ideal candidate for use as a standard for sharing statistical data due to R’s superior data compression technology, R’s open access codex, and R’s ability to easily bundle information into a single file, there is still a bit of a problem posed by the R workspace system.

As far as I know, there are no standards for transferring data between R users.  Thus even though transfers are highly efficient, it is not clear how to organize your data within an R workspace.  This is in contrast to Stata data which has a standard spreadsheet structure with added information in terms of variable labels and factor variables.

The easiest solution to this problem would be to include some kind of standard documentation such as a readme function in any Rdata files released.  This function would display a list of objects in the Rdata file and describe their components.  Further refinements to such a standard might include establishing common names for simple data sets such as naming default data “mydata”.

14 comments:

  1. I am just transferring some files from MATLAB format to R, so I can give one more data point for your data-compression dataset:

    The example is a dataset of numeric values, size 589904 x 7.
    CSV: 32.07 MB
    Matlab: 14.63 MB, read time 260 ms
    RData: 7.58 MB, read time 320 ms

    The idea of including a "readme" object in RData files is really useful! One can even include an info() function that gives you the main information (variable labels etc) about the dataset(s) just by writing info().

    ReplyDelete
    Replies
    1. That'd be reinventing the comment() function. All this is usually dealt with by a package architecture. I use GitHub repos with README files to get approaching results.

      Delete
    2. Oh, I didn't know that comment() was invented! I have to start using that.

      Delete
  2. I am concerned that by transferring data with the .RData extension, the data may become unusable in other programs until it's been opened in R and written out with a new extension. Compressing data is a good idea, which is exactly what R does when you call the base::save function. For example, compressing a text file with gzip can be read into SAS without first unzipping the file in R and saving it with a more universally accepted extension.

    ReplyDelete
    Replies
    1. A reasonable comment of course. I would suggest providing data in two formats: cvs and Rdata. However, if Rdata files become the norm then the responsibility of compatibility between systems will shift to proprietary software providers for providing compatibility with Rdata files. Since, the source code for these files is open then they will have few excuses for not complying to the standards for data transfer.

      Delete
    2. File formats are often dictated by legacy code. French official stats often come in old formats because the backend is coded in SAS or something like that. If you want to change the data standard, you have to provide these legacy routines in R language and hope that they get picked up as quickly as possible.

      Delete
  3. Now now now, everyone knows the official standard data format is Excel.xlsx :-( . I'll just point out you've left out ENVI, gzip, tar, .idt, TIFF, JPG, and about a zillion other file formats. The difficulty in getting .Rdata to be accepted is not only unpacking the objects (which could include closures as well as data arrays and structures) but writing intepreters for the objects. That's a big job -- tho' I was happy to see that Mathematica released an intepreter last year.

    ReplyDelete
  4. Thanks for this. I agree wholeheartedly. My industry tends to use CSV and SASBDAT (SAS) data files. I tend to find that the SASBDAT files are actually larger! than csv files. I really like the readme/info idea with the RData and will start doing that more often.

    ReplyDelete
  5. One potential problem with your proposal: I generated a 61 gigabyte database last week in the course of running some monte carlo simulations. I'm pretty happy that I chose to save it as an SQLite file instead of an RData file (although SQLite doesn't support concurrent writes, which is a pain). Do you know of any ways to incrementally load or save to RData files?

    ReplyDelete
  6. README <- scan("readme.txt",what="character",sep="n")
    now you have a README within the .RData

    ReplyDelete
  7. Why not to use HDF?

    ReplyDelete
    Replies
    1. Here is a test for one data.table using 'rhdf5" package:
      .Rdata: 10MB, 2.8sec to write, 0.6sec to read
      .h5: 21MB, 25sec to write, 2.8sec to read

      at least in this case, .RData wins hands down

      Delete
  8. Exactly. Why use rdata, which nothing other than R really reads, instead of HDF5, which everything under the sun can read.

    ReplyDelete
  9. Has anyone compared to SAS xpt files?

    ReplyDelete