1. R is the most
efficient Statistical software at compressing data
I was recently attempting to download weather data from the
US government and found myself stymied because the dataset I wanted was
considered too large (+5 gigs). The problem I realized was not that I wanted too much data but that the data transfer format was so poor. The only datasheet form available was
csv. I was therefore forced to drop
many variables and resubmit the data request.
Ultimately, I downloaded only some of the pieces of the
data which ended up being a file 627 MB
in size. Importing the data into R via
the read.csv command and immediately saving it as an “RData” file reduced the
size to 55.3 MB (92% reduction). As a
matter of comparison, I imported data into Stata 12 and then saved it in
Stata’s native format which resulted in a reduced size of 318.2 MB (49%
reduction). I compared the zip of both
the csv file and the Stata file as well as the R file. The R file zipped only had trivial gains in
size of 54.3 MB while the compressed Stata file made considerable gains,
taking up only 79.5 MB. The CSV file
compressed to zip still performed the worse, taking up 120.4 MB.
Native Format
|
Read Time
|
Zipped
|
|
Comma Separated Values
|
627.7 MB
|
-
|
120.4 MB
|
R
|
55.3 MB
|
1.12
|
54.3 MB
|
Stata
|
318.2 MB
|
1.24
|
79.5 MB
|
I also clocked how long it would take to read this data into
R vs Stata and found that the difference in read times was not
substantial. However, this should not be
assumed to be the case for all systems since I am running with a solid state
drive which has much higher read speeds than traditional magnetic hard drives.
Looking at other software, I downloaded another dataset recently in which
the data was provided in four different formats, Borland Database Format (DBF)
130MB, Microsoft Access Database (MDB) 110MB, and SPSS/PAWS (SAV) 45MB. After importing the data into R and saving it
as an Rdata file, my resulting Rdata file only took up 3.2 MB.
Size
|
Zipped
|
|
Borland Database Format (DBF)
|
130 MB
|
4.5 MB
|
Microsoft Access Database (MDB)
|
110 MB
|
7.2 MB
|
SPSS/PAWS (SAV)
|
45 MB
|
4.8 MB
|
R (Rdata)
|
3.2 MB
|
3.1 MB
|
This efficiency alone presents a significant case as to why
one should save and distribute data in Rdata files when possible.
2. R codex is open
access
This may not sound like a big deal but the open source
nature of R makes it extremely easy to transfer data from R into any other
program. Quick R gives
sample code that can be used to easily save (or read) data to (from) SPSS, SAS, or Stata. In addition to providing an easy means of
transferring data between statistical programs, R does not face issues relating
to lack of backwards compatibility.
Stata for instance only allows data sets to be saved
backwards for up to three or four previous versions. Thus if you are running Stata 12, 11, or 10
you can only save data sets so that they are compatible with users using Stata
8 or later (see the Stata
help topic). This practice on behalf
of Stata seems unnecessarily harsh since it is in effect forcing users to
upgrade their version of Stata if only to access data saved by users who can
afford access to later versions of Stata.
I suspect that this is not primarily an issue with Stata but
one relating to proprietary software in general. Proprietary software companies would like to
encourage soft handedly or strong handedly the purchase of newer software even
if it is at the expense of current users.
However, this issue does not exist using R. Thus, you can be assured that by saving data
in R, anybody should be able to access your data.
3. R Projects are Easily Bundled
Different types of data files allow for different levels of embedded
descriptive information such as Stata’s variable labels. As far as I know, R has the most extensive
option available for bundling information into a single file. Not only can R save the data files and
descriptive labels in a single bundle but functions which are specific to use
with the data may be included as well in the same bundle. For example, if you are working with data on
health you may be interested in not only having the BMI index and other health
indexes for each individual but also having the functions that calculate these
indexes. Including these functions is
simple within an Rdata file.
It may not be clear to some users that this is really a
large advantage because nearly all statistical software packages as far as I
know include the possibility of producing external script files. R has this option as well. The advantage to R is that it not only
includes that option but also includes the option of embedding complex or
unique functions within Rdata files.
So R, Now What?
If you accept that R is an ideal candidate for use as a
standard for sharing statistical data due to R’s superior data compression
technology, R’s open access codex, and R’s ability to easily bundle information
into a single file, there is still a bit of a problem posed by the R workspace
system.
As far as I know, there are no standards for transferring
data between R users. Thus even though
transfers are highly efficient, it is not clear how to organize your data
within an R workspace. This is in
contrast to Stata data which has a standard spreadsheet structure with added
information in terms of variable labels and factor variables.
The easiest solution to this problem would be to include
some kind of standard documentation such as a readme function in any Rdata
files released. This function would
display a list of objects in the Rdata file and describe their components. Further refinements to such a standard might
include establishing common names for simple data sets such as naming default
data “mydata”.
I am just transferring some files from MATLAB format to R, so I can give one more data point for your data-compression dataset:
ReplyDeleteThe example is a dataset of numeric values, size 589904 x 7.
CSV: 32.07 MB
Matlab: 14.63 MB, read time 260 ms
RData: 7.58 MB, read time 320 ms
The idea of including a "readme" object in RData files is really useful! One can even include an info() function that gives you the main information (variable labels etc) about the dataset(s) just by writing info().
That'd be reinventing the comment() function. All this is usually dealt with by a package architecture. I use GitHub repos with README files to get approaching results.
DeleteOh, I didn't know that comment() was invented! I have to start using that.
DeleteI am concerned that by transferring data with the .RData extension, the data may become unusable in other programs until it's been opened in R and written out with a new extension. Compressing data is a good idea, which is exactly what R does when you call the base::save function. For example, compressing a text file with gzip can be read into SAS without first unzipping the file in R and saving it with a more universally accepted extension.
ReplyDeleteA reasonable comment of course. I would suggest providing data in two formats: cvs and Rdata. However, if Rdata files become the norm then the responsibility of compatibility between systems will shift to proprietary software providers for providing compatibility with Rdata files. Since, the source code for these files is open then they will have few excuses for not complying to the standards for data transfer.
DeleteFile formats are often dictated by legacy code. French official stats often come in old formats because the backend is coded in SAS or something like that. If you want to change the data standard, you have to provide these legacy routines in R language and hope that they get picked up as quickly as possible.
DeleteNow now now, everyone knows the official standard data format is Excel.xlsx :-( . I'll just point out you've left out ENVI, gzip, tar, .idt, TIFF, JPG, and about a zillion other file formats. The difficulty in getting .Rdata to be accepted is not only unpacking the objects (which could include closures as well as data arrays and structures) but writing intepreters for the objects. That's a big job -- tho' I was happy to see that Mathematica released an intepreter last year.
ReplyDeleteThanks for this. I agree wholeheartedly. My industry tends to use CSV and SASBDAT (SAS) data files. I tend to find that the SASBDAT files are actually larger! than csv files. I really like the readme/info idea with the RData and will start doing that more often.
ReplyDeleteOne potential problem with your proposal: I generated a 61 gigabyte database last week in the course of running some monte carlo simulations. I'm pretty happy that I chose to save it as an SQLite file instead of an RData file (although SQLite doesn't support concurrent writes, which is a pain). Do you know of any ways to incrementally load or save to RData files?
ReplyDeleteREADME <- scan("readme.txt",what="character",sep="n")
ReplyDeletenow you have a README within the .RData
Why not to use HDF?
ReplyDeleteHere is a test for one data.table using 'rhdf5" package:
Delete.Rdata: 10MB, 2.8sec to write, 0.6sec to read
.h5: 21MB, 25sec to write, 2.8sec to read
at least in this case, .RData wins hands down
Exactly. Why use rdata, which nothing other than R really reads, instead of HDF5, which everything under the sun can read.
ReplyDeleteHas anyone compared to SAS xpt files?
ReplyDelete