In this post I will go through 5 reasons: zero cost, crazy popularity, awesome power, dazzling flexibility, and mind-blowing support. I believe R is the best statistical programming language to learn. As a blogger who has contributed over 150 posts in Stata and over 100 in R I have extensive experience with both a proprietary statistical programming language as well as the open source alternative. In my graduate career I have also had the opportunity to experiment with the proprietary software SPSS, SAS, Mathematica, as well as MPlus.
Acknowledgements
I would like to thank the bloggers Robert A. Meunchen and David Smith whose informative blog posts were useful in researching this post. I would also like to thank my fiance Jennifer Cairns for suggesting the idea of this blog post. Perhaps some day you will give up your Stata loving ways.
Acknowledgements
I would like to thank the bloggers Robert A. Meunchen and David Smith whose informative blog posts were useful in researching this post. I would also like to thank my fiance Jennifer Cairns for suggesting the idea of this blog post. Perhaps some day you will give up your Stata loving ways.
R is Free Bob
Free for you
Not much
needs to be said about this point. R is
free to install, use, update, clone, modify, redistribute, even sell. You can find installation for R
with Linux, Max OS X, and Windows at r-project.org (CRAN). Choosing to code in R can help you avoid a significant blow to your check book.
Free for everyone
Even if you have the budget to afford
any software package you desire it is unlikely everybody you end up working
with will have such a lucrative budget. Failures to have equal access to software can pose significant barriers to collaborative work. Not so long ago I was a research assistant on a project in which most
students ended up having to purchase licenses in order to work on the project
on their personal computers. I believe
each of the students ended up spending around $400.
Table 1: Pricing of Perpetual Usage Licenses (with 1 year support if available)
Stata (SE)
|
SPSS
|
SAS
|
STATISTICA QC
|
|
Student
|
||||
Business
|
Quote Requested
|
So sure, it is generally inexpensive to get a
student copy for some software but be prepared to either pay out large in
the future or have your company pay out hugely.
In addition, don’t be mistaken into thinking these are onetime
costs. Most statistical packages only
provide “support” for the two most recent major releases. Thus, if you want your product to be
supported, able to install new commands, or even access data saved by later versions,
you must continue to purchase new versions of the software as it is released.
Looking at
this graph it is clear that except for SAS, major version release rates of proprietary
software is very high meaning that purchasing a perpetual user license only
provides you with access to the most recent version of the software for on
average a year or two years.
Major Version Release History Table
Year
|
Stata
|
SPSS
|
SAS
|
STATISTICA
|
R
|
1999
|
|||||
2000
|
|||||
2001
|
|||||
2002
|
|||||
2003
|
|||||
2004
|
|||||
2005
|
|||||
2006
|
|||||
2007
|
|||||
2008
|
|||||
2009
|
|||||
2010
|
|||||
2011
|
|||||
2012
|
|||||
2013
|
|||||
2014
|
From this table one may incorrectly
assume that R is not as well maintained as proprietary software. However, this table only tracks major
releases. A better measure is
the number of subversion releases/updates.
Most statistical software only releases one or less subversions per
major release (Stata 1.0 subversions, SPSS 0.8 subversions, Statistica 0.25
subversions since 1999) compared with R which has released on average 16.3
subversions per major release starting in 2000.
This means that since inception R has already gone through 52 versions. This is a much higher rate of updating than
any proprietary software options.
R is Popular
Though it is a little challenging to
figure out exactly what rubric to used when measuring popularity, R seems to be
growing rapidly in popularity among general users as well as employers . Robert A. Meuchen has an excellent
blog where he
examines the ongoing popularity of data analysis software.
R job prospects are rapidly
increasing comparing R against a host of alternative software. The search for R is complicated by the
difficulty of its ambiguous name. Graph (borrowed
from Robert A. Meuchen’s blog).
In addition to what graphs presented on are Meuchen’s blog, I also include the Google Trends' search results which he
does not. In his post he does not
because he is concerned that there is too much noise when searching for
“R”. However, google is testing a trends
feature that allows it to pick up on “R Programming Language” which I believe
should be much more accurate than other previous searches which might confuse
“R” with Toys “R” Us etc.
Popularity of Google Searches for
Statistical Software
But why do we care how popular R
is? Programming languages (which all
statistical software worth their salt have) are highly dependent upon their user
base in order to develop. How fast they
develop, how powerful they are, and how long they expect to be supported is
entirely based on how widely they are used.
R is Becoming the Standard
In
a world in which time is limited and that involved in learning a
statistical package is nontrivial, learning to program in a system that
is unpopular or unsustainable can be futile and frustrating. I will not
make any predictions as to the life expectancy of any propriety
software options out there except to say, there are a lot of expensive
options in a market in which the most competitive option (R) is
free. I do not know how long proprietary options will be around, but
some version of R is likely to remain popular for the indefinite
future. R is well maintained by an active and highly talented
community.
Thus,
as the emerging standard for statistical programming, it is likely to
be a highly rewarding process (both fiscally and in terms of
opportunities) to learn to use R.
R is Popular with Employers
In two recent studies including one
of over 17,000 technology professionals, R was the highest paid technical skill with an
average salary of 115,531 (Read more on this here).
R is Powerful
R can handle complex and large data
Here are some useful articles related
to large data and R:
This is my area of expertise. On my blog EconometricsBySimulation.com I have over 100 releases or code in
R and 170 releases in Stata. The
majority of these posts feature some kind of simulation. In general I find Stata to be faster to work
with when doing simple simulations. However,
if there is any kind of complexity involved in the simulation, working with
Stata is a nightmare compared with R. I
have also worked with simulations in SAS (for my master’s thesis). SAS was by far the most frustrating
programming experience of my life.
R
can be used on High Performance Computer Clusters
High performance computer clusters
are large computer clusters (often in universities) which manage the processing
capacity of hundreds or thousands of processors simultaneously. The systems are able to crunch through
simulations, data, or analysis at a rate much higher than that typically
achievable on most individualized systems.
Thus problems which could take weeks to solve on your personal computer
might take an HPCC user account only hours.
The way these systems do this is by
distributing individual tasks across many different processors. In the event of using proprietary software,
unless there is a special license designed just for this purpose (of which I am
not aware of any except possibly SAS or SPSS by using a third party toolkit).
Proprietary systems are inherently
limited by the nature of their architecture and their price. . So
even if you have access to an HPCC cluster, you could find it cost prohibitive
to purchase.
R supports multicore task
distribution
In the modern world nearly all
computers come with multiple cores yet traditional programming techniques
assume only one processor. Most
statistical programming languages have responded to the new structure of
programming by allowing for individual commands to be threaded out to different
cores. Some of these programs such as
Stata actually make you buy a different license based on the number of cores
you are using (perpetual without “maintenance” 1 core $ 1,695, 2 cores $2,495, 4 cores $
3,125, 6 cores $3,410, 8 cores
$3,695, 12 cores $4,160, 16 cores $4,625, 24 cores $5,045, 32 cores $5,460, and
64 cores $6,445).
Expected
speed gains from parallel processors (from Stata)
Let’s compare the pricing scheme to
the expected increase in performance. We
can see that going from 1 processor to 8 processors (a $2,000 cost increase)
only increases processing speed for most commands by an expected rate around 3x
faster. This is a significant increase
in speed though an even more significant increase in software cost.
This would not be so surprising and
disturbing if a major portion of the advances in computer speed were not based
on multicore processing. Individual
cores have generally gained little in speed over the last 5 years for instance
while multicore processing has become very popular. Therefore, for the average user who only pays
$1,600 for their license of Stata, more likely than not, his or her version
will be running into similar hardware challenges as would have been face half a
decade ago.
For similar gains in speed through
automatically distributing computational tasks in R, a spinoff has been
developed called pqR (pretty quick R). In addition, there are a number of packages
developed that allow for multicore distributed tasks. I personally have not experimented with them.
My overall take away in terms of
multicore processing is that Stata is going to be easier to see gains in speeds
from having more cores (if you are willing to shell out the cash), however it
is possible to accomplish similar gains in speed using a non-mainstream flavor
of R called pqR or by implementing one
of the several packages.
R is Flexible
From complex or standard statistical
practices, to baysian modelling, to GIS map building, to building interactive
web applications, and to building interactive tests.
Here are some resources:
Here are some resources:
1. Statistical/Econometrics
Models
a. Basics
b. Panel data
c. Time series
d. Spatial Econometrics
A wonderful feature of proprietary
software is the software’s paid help line.
When I had a supported version of Stata I could send an email to Stata Corp
and get a response usually within a few hours.
This feature significantly increased my productivity and ultimately was
something I felt would find difficult to live without when I first contemplated
switching to R. However, as I became
more aware of the resources available in R, I realized that the support I could
find for my R questions was much faster and more thorough than that typically
even available through my paid service at Stata.
StackOverflow
(SO)
SO is a mind blowing resource! This Question and Answer structured website is
the place to find answers for all of your coding questions. Most questions I come up with someone else
has already asked and I can usually find them answered on the site. New questions though are even better! The average response time to answer the
questions I ask on SO seems to be less than 10 minutes. Plus, they have a great reputation system in
which people who ask good questions and people who give good answers are
rewarded. Every time I post a question
on SO, I am astonished again by how fast it is answered. Looking back on the Stata help line, I
realize that it would be hard to go back to such a sluggish system.
R-Bloggers
I have been a reader of R-Bloggers
for a couple of years now and I am continually learning something new. As a blog aggregator of over 450 contributing
blogs, it is an excellent way to learn new or refresh old skills in R. In addition, as a blogger, it is a great way
to expand your audience. R-Bloggers also
provides a resource by storing blog posts from blogs that have been retired or
are no longer active. As another
comparison with Stata, I started a Stata “blog aggregator” (http://stata-bloggers.com/) after getting
involved and excited by R-Bloggers. I
could only find a handful of blogs on Stata.
In addition, humorously, the official Stata blog was uninterested in
being aggregated on the aggregator, providing what seemed to me a signal that
Stata Corp was not interested in supporting a general Stata blogging movement.
Maureen Tippel, Greg Kraft, MKT Management
Maureen Tippel, Greg Kraft, MKT Management