In this post I will go through 5 reasons: zero cost, crazy popularity, awesome power, dazzling flexibility, and mind-blowing support. I believe R is the best statistical programming language to learn. As a blogger who has contributed over 150 posts in Stata and over 100 in R I have extensive experience with both a proprietary statistical programming language as well as the open source alternative. In my graduate career I have also had the opportunity to experiment with the proprietary software SPSS, SAS, Mathematica, as well as MPlus.
Acknowledgements
I would like to thank the bloggers Robert A. Meunchen and David Smith whose informative blog posts were useful in researching this post. I would also like to thank my fiance Jennifer Cairns for suggesting the idea of this blog post. Perhaps some day you will give up your Stata loving ways.
Acknowledgements
I would like to thank the bloggers Robert A. Meunchen and David Smith whose informative blog posts were useful in researching this post. I would also like to thank my fiance Jennifer Cairns for suggesting the idea of this blog post. Perhaps some day you will give up your Stata loving ways.
R is Free Bob
Free for you
Not much
needs to be said about this point. R is
free to install, use, update, clone, modify, redistribute, even sell. You can find installation for R
with Linux, Max OS X, and Windows at r-project.org (CRAN). Choosing to code in R can help you avoid a significant blow to your check book.
Free for everyone
Even if you have the budget to afford
any software package you desire it is unlikely everybody you end up working
with will have such a lucrative budget. Failures to have equal access to software can pose significant barriers to collaborative work. Not so long ago I was a research assistant on a project in which most
students ended up having to purchase licenses in order to work on the project
on their personal computers. I believe
each of the students ended up spending around $400.
Table 1: Pricing of Perpetual Usage Licenses (with 1 year support if available)
Stata (SE)
|
SPSS
|
SAS
|
STATISTICA QC
|
|
Student
|
||||
Business
|
Quote Requested
|
So sure, it is generally inexpensive to get a
student copy for some software but be prepared to either pay out large in
the future or have your company pay out hugely.
In addition, don’t be mistaken into thinking these are onetime
costs. Most statistical packages only
provide “support” for the two most recent major releases. Thus, if you want your product to be
supported, able to install new commands, or even access data saved by later versions,
you must continue to purchase new versions of the software as it is released.
Looking at
this graph it is clear that except for SAS, major version release rates of proprietary
software is very high meaning that purchasing a perpetual user license only
provides you with access to the most recent version of the software for on
average a year or two years.
Major Version Release History Table
Year
|
Stata
|
SPSS
|
SAS
|
STATISTICA
|
R
|
1999
|
|||||
2000
|
|||||
2001
|
|||||
2002
|
|||||
2003
|
|||||
2004
|
|||||
2005
|
|||||
2006
|
|||||
2007
|
|||||
2008
|
|||||
2009
|
|||||
2010
|
|||||
2011
|
|||||
2012
|
|||||
2013
|
|||||
2014
|
From this table one may incorrectly
assume that R is not as well maintained as proprietary software. However, this table only tracks major
releases. A better measure is
the number of subversion releases/updates.
Most statistical software only releases one or less subversions per
major release (Stata 1.0 subversions, SPSS 0.8 subversions, Statistica 0.25
subversions since 1999) compared with R which has released on average 16.3
subversions per major release starting in 2000.
This means that since inception R has already gone through 52 versions. This is a much higher rate of updating than
any proprietary software options.
R is Popular
Though it is a little challenging to
figure out exactly what rubric to used when measuring popularity, R seems to be
growing rapidly in popularity among general users as well as employers . Robert A. Meuchen has an excellent
blog where he
examines the ongoing popularity of data analysis software.
R job prospects are rapidly
increasing comparing R against a host of alternative software. The search for R is complicated by the
difficulty of its ambiguous name. Graph (borrowed
from Robert A. Meuchen’s blog).
In addition to what graphs presented on are Meuchen’s blog, I also include the Google Trends' search results which he
does not. In his post he does not
because he is concerned that there is too much noise when searching for
“R”. However, google is testing a trends
feature that allows it to pick up on “R Programming Language” which I believe
should be much more accurate than other previous searches which might confuse
“R” with Toys “R” Us etc.
Popularity of Google Searches for
Statistical Software
But why do we care how popular R
is? Programming languages (which all
statistical software worth their salt have) are highly dependent upon their user
base in order to develop. How fast they
develop, how powerful they are, and how long they expect to be supported is
entirely based on how widely they are used.
R is Becoming the Standard
In
a world in which time is limited and that involved in learning a
statistical package is nontrivial, learning to program in a system that
is unpopular or unsustainable can be futile and frustrating. I will not
make any predictions as to the life expectancy of any propriety
software options out there except to say, there are a lot of expensive
options in a market in which the most competitive option (R) is
free. I do not know how long proprietary options will be around, but
some version of R is likely to remain popular for the indefinite
future. R is well maintained by an active and highly talented
community.
Thus,
as the emerging standard for statistical programming, it is likely to
be a highly rewarding process (both fiscally and in terms of
opportunities) to learn to use R.
R is Popular with Employers
In two recent studies including one
of over 17,000 technology professionals, R was the highest paid technical skill with an
average salary of 115,531 (Read more on this here).
R is Powerful
R can handle complex and large data
Here are some useful articles related
to large data and R:
This is my area of expertise. On my blog EconometricsBySimulation.com I have over 100 releases or code in
R and 170 releases in Stata. The
majority of these posts feature some kind of simulation. In general I find Stata to be faster to work
with when doing simple simulations. However,
if there is any kind of complexity involved in the simulation, working with
Stata is a nightmare compared with R. I
have also worked with simulations in SAS (for my master’s thesis). SAS was by far the most frustrating
programming experience of my life.
R
can be used on High Performance Computer Clusters
High performance computer clusters
are large computer clusters (often in universities) which manage the processing
capacity of hundreds or thousands of processors simultaneously. The systems are able to crunch through
simulations, data, or analysis at a rate much higher than that typically
achievable on most individualized systems.
Thus problems which could take weeks to solve on your personal computer
might take an HPCC user account only hours.
The way these systems do this is by
distributing individual tasks across many different processors. In the event of using proprietary software,
unless there is a special license designed just for this purpose (of which I am
not aware of any except possibly SAS or SPSS by using a third party toolkit).
Proprietary systems are inherently
limited by the nature of their architecture and their price. . So
even if you have access to an HPCC cluster, you could find it cost prohibitive
to purchase.
R supports multicore task
distribution
In the modern world nearly all
computers come with multiple cores yet traditional programming techniques
assume only one processor. Most
statistical programming languages have responded to the new structure of
programming by allowing for individual commands to be threaded out to different
cores. Some of these programs such as
Stata actually make you buy a different license based on the number of cores
you are using (perpetual without “maintenance” 1 core $ 1,695, 2 cores $2,495, 4 cores $
3,125, 6 cores $3,410, 8 cores
$3,695, 12 cores $4,160, 16 cores $4,625, 24 cores $5,045, 32 cores $5,460, and
64 cores $6,445).
Expected
speed gains from parallel processors (from Stata)
Let’s compare the pricing scheme to
the expected increase in performance. We
can see that going from 1 processor to 8 processors (a $2,000 cost increase)
only increases processing speed for most commands by an expected rate around 3x
faster. This is a significant increase
in speed though an even more significant increase in software cost.
This would not be so surprising and
disturbing if a major portion of the advances in computer speed were not based
on multicore processing. Individual
cores have generally gained little in speed over the last 5 years for instance
while multicore processing has become very popular. Therefore, for the average user who only pays
$1,600 for their license of Stata, more likely than not, his or her version
will be running into similar hardware challenges as would have been face half a
decade ago.
For similar gains in speed through
automatically distributing computational tasks in R, a spinoff has been
developed called pqR (pretty quick R). In addition, there are a number of packages
developed that allow for multicore distributed tasks. I personally have not experimented with them.
My overall take away in terms of
multicore processing is that Stata is going to be easier to see gains in speeds
from having more cores (if you are willing to shell out the cash), however it
is possible to accomplish similar gains in speed using a non-mainstream flavor
of R called pqR or by implementing one
of the several packages.
R is Flexible
From complex or standard statistical
practices, to baysian modelling, to GIS map building, to building interactive
web applications, and to building interactive tests.
Here are some resources:
Here are some resources:
1. Statistical/Econometrics
Models
a. Basics
b. Panel data
c. Time series
d. Spatial Econometrics
A wonderful feature of proprietary
software is the software’s paid help line.
When I had a supported version of Stata I could send an email to Stata Corp
and get a response usually within a few hours.
This feature significantly increased my productivity and ultimately was
something I felt would find difficult to live without when I first contemplated
switching to R. However, as I became
more aware of the resources available in R, I realized that the support I could
find for my R questions was much faster and more thorough than that typically
even available through my paid service at Stata.
StackOverflow
(SO)
SO is a mind blowing resource! This Question and Answer structured website is
the place to find answers for all of your coding questions. Most questions I come up with someone else
has already asked and I can usually find them answered on the site. New questions though are even better! The average response time to answer the
questions I ask on SO seems to be less than 10 minutes. Plus, they have a great reputation system in
which people who ask good questions and people who give good answers are
rewarded. Every time I post a question
on SO, I am astonished again by how fast it is answered. Looking back on the Stata help line, I
realize that it would be hard to go back to such a sluggish system.
R-Bloggers
I have been a reader of R-Bloggers
for a couple of years now and I am continually learning something new. As a blog aggregator of over 450 contributing
blogs, it is an excellent way to learn new or refresh old skills in R. In addition, as a blogger, it is a great way
to expand your audience. R-Bloggers also
provides a resource by storing blog posts from blogs that have been retired or
are no longer active. As another
comparison with Stata, I started a Stata “blog aggregator” (http://stata-bloggers.com/) after getting
involved and excited by R-Bloggers. I
could only find a handful of blogs on Stata.
In addition, humorously, the official Stata blog was uninterested in
being aggregated on the aggregator, providing what seemed to me a signal that
Stata Corp was not interested in supporting a general Stata blogging movement.
Maureen Tippel, Greg Kraft, MKT Management
Maureen Tippel, Greg Kraft, MKT Management
Great post. Thanks Jennifer
ReplyDeleteI'd particularly stress the innovation provided by the constant stream of new and updated packages, often well-supported by their authors
:)
DeleteThank you for the post.
ReplyDeleteThe link to the Panel Data Introduction slides from Princeton seems to be broken. I also find the vignette of the package plm very helpfull for introducing panal data analysis in R: http://cran.r-project.org/web/packages/plm/vignettes/plm.pdf
Thanks for this correction.
DeletePlease fix the last name of Bob Muenchen (which you consistently missspell as "Meuchen").
ReplyDeleteThank you, sorry about that. Too many late nights.
DeleteThanks for pointing out the new Google Trends feature. It took me a while to figure out that if you enter "Stata" it automatically suggests "Stata software" but when you enter "R" it does not offer variations. However, if you type "R programming" it will then offer "R Programming Language". Very nice! I'll probably add this back to my popularity article that you reference.
ReplyDeleteIt's crazy that Statacorp did not want to syndicate its blog to stata-bloggers.com! Companies should be doing all they can to encourage their users to offer them free marketing.
Do you know how to count the number of commands in Stata? I did so for SAS and found that R was adding more new commands in one year than SAS added in 40 (http://r4stats.com/2013/03/19/r-2012-growth-exceeds-sas-all-time-total/). I suspect the same is true for Stata. The commands are directly comparable of course, but it is a measure that puts R phenomenal growth in perspective.
Cheers,
Bob Muenchen
Hi Bob, I wanted to mention that I looked more closely at the google trends data and am not convinced I believe the graph any more. The primary problem is that the top search result by far is \r which is a character likely referenced in many languages.
DeleteAs for the number of commands in Stata. I am pretty sure that number is very low as well compared with R. It is just so much harder to write commands in Stata syntax. I have had extensive experience with it yet still find it difficult and counter intuitive. It is all because of the temperamental way Stata parses commands using spacing and ordering.
As for the list of Stata packages. I am not sure if there is any comprehensive public list. Within Stata I searched for "search package, all" which returned 499 results and "search command, all" which returned 697 results. I copied the first line of each of the list which lists just the name or url and removed duplicates. My list has 1325 items, I have included it here with commas separating each value. It is worth noting that packages in Stata are not the same as packages in R. All packages that I am aware of in Stata only have a single command. In R, it is almost always the case that multiple functions/commands are bundled within the same package.
DeleteLooking at the base folder in Stata, I searched for *.ado and found a little over 23 hundred ado files which are individual commands. However, some if not many of these ado files are shortcuts for other commands. Thus many commands are counted too many times such as reg regr rege reges regress all are the same command.
I also wanted to say that I don't think it was entirely foolish for Stata to not embrace the blogging community and in particular my blog. I started blogging in Stata and initially I received several encouraging emails from the corporation I eventually ended up writing some posts which were unflattering to the organization. Some of which they responded to.
However, by the time it came to me starting the aggregator, I think they already had decided as a cooperation to distance themselves from the blogging community since I was not the only blog out there which had written posts in favor or R over Stata.
List of Stata commands
https://gist.github.com/EconometricsBySimulation/9958775
In the cost comparison for SPSS, that $5430 is only for the base version (commercial license). It climbs to $16k+ for the 'Premium' commercial version of SPSS. Ouch. A commercial license SAS is similarly expensive, once you're beyond the basics. And a 'student' academic SAS license does expire. The equivalents to all that are, of course, entirely free in R.
ReplyDelete