Thursday, March 27, 2014

Why use R? Five reasons.

In this post I will go through 5 reasons: zero cost, crazy popularity, awesome power, dazzling flexibility, and mind-blowing support. I believe R is the best statistical programming language to learn. As a blogger who has contributed over 150 posts in Stata and over 100 in R I have extensive experience with both a proprietary statistical programming language as well as the open source alternative.  In my graduate career I have also had the opportunity to experiment with the proprietary software SPSS, SAS, Mathematica, as well as MPlus.

Acknowledgements
I would like to thank the bloggers Robert A. Meunchen and David Smith whose informative blog posts were useful in researching this post.  I would also like to thank my fiance Jennifer Cairns for suggesting the idea of this blog post.  Perhaps some day you will give up your Stata loving ways.

R is Free Bob

Free for you
Not much needs to be said about this point.  R is free to install, use, update, clone, modify, redistribute, even sell.  You can find installation for R with Linux, Max OS X, and Windows at r-project.org (CRAN).  Choosing to code in R can help you avoid a significant blow to your check book.

Free for everyone
Even if you have the budget to afford any software package you desire it is unlikely everybody you end up working with will have such a lucrative budget.  Failures to have equal access to software can pose significant barriers to collaborative work.  Not so long ago I was a research assistant on a project in which most students ended up having to purchase licenses in order to work on the project on their personal computers.  I believe each of the students ended up spending around $400.

Table 1: Pricing of Perpetual Usage Licenses (with 1 year support if available)

Stata (SE)
SPSS
SAS
STATISTICA QC
Student
Business
Quote Requested


So sure, it is generally inexpensive to get a student copy for some software but be prepared to either pay out large in the future or have your company pay out hugely.  In addition, don’t be mistaken into thinking these are onetime costs.  Most statistical packages only provide “support” for the two most recent major releases.  Thus, if you want your product to be supported, able to install new commands, or even access data saved by later versions, you must continue to purchase new versions of the software as it is released.



Looking at this graph it is clear that except for SAS, major version release rates of proprietary software is very high meaning that purchasing a perpetual user license only provides you with access to the most recent version of the software for on average a year or two years.

Major Version Release History Table


Year
Stata
SPSS
SAS
STATISTICA
R
1999
6
9
8
5
1
2000
7
8
5
1
2001
7
8
6
1
2002
7
9
6
1
2003
8
9
6
1
2004
8
9
7
2
2005
9
9
7
2
2006
9
9
7
2
2007
9
8
2
2008
9
8
2
2009
9
8
2
2010
9
9
2
2011
9
2
2012
9
2
2013
9
3
2014
9
3


From this table one may incorrectly assume that R is not as well maintained as proprietary software.  However, this table only tracks major releases.  A better measure is the number of subversion releases/updates.  Most statistical software only releases one or less subversions per major release (Stata 1.0 subversions, SPSS 0.8 subversions, Statistica 0.25 subversions since 1999) compared with R which has released on average 16.3 subversions per major release starting in 2000.  This means that since inception R has already gone through 52 versions.  This is a much higher rate of updating than any proprietary software options. 

R is Popular

Though it is a little challenging to figure out exactly what rubric to used when measuring popularity, R seems to be growing rapidly in popularity among general users as well as employers .  Robert A. Meuchen has an excellent blog where he examines the ongoing popularity of data analysis software.  
R job prospects are rapidly increasing comparing R against a host of alternative software.  The search for R is complicated by the difficulty of its ambiguous name.  Graph (borrowed from  Robert A. Meuchen’s blog).

In addition to what graphs presented on are Meuchen’s blog, I also include the Google Trends' search results which he does not.  In his post he does not because he is concerned that there is too much noise when searching for “R”.  However, google is testing a trends feature that allows it to pick up on “R Programming Language” which I believe should be much more accurate than other previous searches which might confuse “R” with Toys “R” Us etc.

Popularity of Google Searches for Statistical Software

But why do we care how popular R is?  Programming languages (which all statistical software worth their salt have) are highly dependent upon their user base in order to develop.  How fast they develop, how powerful they are, and how long they expect to be supported is entirely based on how widely they are used.

R is Becoming the Standard 
In a world in which time is limited and that involved in learning a statistical package is nontrivial, learning to program in a system that is unpopular or unsustainable can be futile and frustrating.  I will not make any predictions as to the life expectancy of any propriety software options out there except to say, there are a lot of expensive options in a market in which the most competitive option (R) is free.  I do not know how long proprietary options will be around, but some version of R is likely to remain popular for the indefinite future.  R is well maintained by an active and highly talented community.

Thus, as the emerging standard for statistical programming, it is likely to be a highly rewarding process (both fiscally and in terms of opportunities) to learn to use R.

R is Popular with Employers
In two recent studies including one of over 17,000 technology professionals, R was the highest paid technical skill with an average salary of 115,531 (Read more on this here).


R is Powerful


R can handle complex and large data
Here are some useful articles related to large data and R:
3. Tips on Computing with Big Data in R

R can easily program complex simulations
This is my area of expertise.  On my blog EconometricsBySimulation.com I have over 100 releases or code in R and 170 releases in Stata.  The majority of these posts feature some kind of simulation.  In general I find Stata to be faster to work with when doing simple simulations.  However, if there is any kind of complexity involved in the simulation, working with Stata is a nightmare compared with R.  I have also worked with simulations in SAS (for my master’s thesis).  SAS was by far the most frustrating programming experience of my life.

R can be used on High Performance Computer Clusters
High performance computer clusters are large computer clusters (often in universities) which manage the processing capacity of hundreds or thousands of processors simultaneously.  The systems are able to crunch through simulations, data, or analysis at a rate much higher than that typically achievable on most individualized systems.  Thus problems which could take weeks to solve on your personal computer might take an HPCC user account only hours.

The way these systems do this is by distributing individual tasks across many different processors.  In the event of using proprietary software, unless there is a special license designed just for this purpose (of which I am not aware of any except possibly SAS or SPSS by using a third party toolkit). 
Proprietary systems are inherently limited by the nature of their architecture and their price.  .  So even if you have access to an HPCC cluster, you could find it cost prohibitive to purchase.
  
R supports multicore task distribution
In the modern world nearly all computers come with multiple cores yet traditional programming techniques assume only one processor.  Most statistical programming languages have responded to the new structure of programming by allowing for individual commands to be threaded out to different cores.  Some of these programs such as Stata actually make you buy a different license based on the number of cores you are using (perpetual without “maintenance” 1 core $ 1,695, 2 cores $2,495, 4 cores $ 3,125, 6 cores $3,410, 8 cores $3,695, 12 cores $4,160, 16 cores $4,625, 24 cores $5,045, 32 cores $5,460, and 64 cores $6,445). 


Expected speed gains from parallel processors (from Stata)

Let’s compare the pricing scheme to the expected increase in performance.  We can see that going from 1 processor to 8 processors (a $2,000 cost increase) only increases processing speed for most commands by an expected rate around 3x faster.  This is a significant increase in speed though an even more significant increase in software cost.

This would not be so surprising and disturbing if a major portion of the advances in computer speed were not based on multicore processing.  Individual cores have generally gained little in speed over the last 5 years for instance while multicore processing has become very popular.  Therefore, for the average user who only pays $1,600 for their license of Stata, more likely than not, his or her version will be running into similar hardware challenges as would have been face half a decade ago.

For similar gains in speed through automatically distributing computational tasks in R, a spinoff has been developed called pqR (pretty quick R).  In addition, there are a number of packages developed that allow for multicore distributed tasks.  I personally have not experimented with them.

My overall take away in terms of multicore processing is that Stata is going to be easier to see gains in speeds from having more cores (if you are willing to shell out the cash), however it is possible to accomplish similar gains in speed using a non-mainstream flavor of R called pqR or by implementing one of the several packages.

R is Flexible

From complex or standard statistical practices, to baysian modelling, to GIS map building, to building interactive web applications, and to building interactive tests.   

Here are some resources:
1. Statistical/Econometrics Models
  a. Basics
  b. Panel data
      ii.   Econometrics Academy – Software R
      iii.  PLM Vignette 
  c. Time series
  d. Spatial Econometrics
A wonderful feature of proprietary software is the software’s paid help line.  When I had a supported version of Stata I could send an email to Stata Corp and get a response usually within a few hours.  This feature significantly increased my productivity and ultimately was something I felt would find difficult to live without when I first contemplated switching to R.  However, as I became more aware of the resources available in R, I realized that the support I could find for my R questions was much faster and more thorough than that typically even available through my paid service at Stata.

      StackOverflow (SO)
SO is a mind blowing resource!  This Question and Answer structured website is the place to find answers for all of your coding questions.  Most questions I come up with someone else has already asked and I can usually find them answered on the site.  New questions though are even better!  The average response time to answer the questions I ask on SO seems to be less than 10 minutes.  Plus, they have a great reputation system in which people who ask good questions and people who give good answers are rewarded.  Every time I post a question on SO, I am astonished again by how fast it is answered.  Looking back on the Stata help line, I realize that it would be hard to go back to such a sluggish system.

R-Bloggers
I have been a reader of R-Bloggers for a couple of years now and I am continually learning something new.  As a blog aggregator of over 450 contributing blogs, it is an excellent way to learn new or refresh old skills in R.  In addition, as a blogger, it is a great way to expand your audience.  R-Bloggers also provides a resource by storing blog posts from blogs that have been retired or are no longer active.  As another comparison with Stata, I started a Stata “blog aggregator” (http://stata-bloggers.com/) after getting involved and excited by R-Bloggers.  I could only find a handful of blogs on Stata.  In addition, humorously, the official Stata blog was uninterested in being aggregated on the aggregator, providing what seemed to me a signal that Stata Corp was not interested in supporting a general Stata blogging movement.

Maureen Tippel, Greg Kraft, MKT Management

10 comments:

  1. Great post. Thanks Jennifer

    I'd particularly stress the innovation provided by the constant stream of new and updated packages, often well-supported by their authors

    ReplyDelete
  2. Thank you for the post.

    The link to the Panel Data Introduction slides from Princeton seems to be broken. I also find the vignette of the package plm very helpfull for introducing panal data analysis in R: http://cran.r-project.org/web/packages/plm/vignettes/plm.pdf

    ReplyDelete
  3. Please fix the last name of Bob Muenchen (which you consistently missspell as "Meuchen").

    ReplyDelete
    Replies
    1. Thank you, sorry about that. Too many late nights.

      Delete
  4. Thanks for pointing out the new Google Trends feature. It took me a while to figure out that if you enter "Stata" it automatically suggests "Stata software" but when you enter "R" it does not offer variations. However, if you type "R programming" it will then offer "R Programming Language". Very nice! I'll probably add this back to my popularity article that you reference.

    It's crazy that Statacorp did not want to syndicate its blog to stata-bloggers.com! Companies should be doing all they can to encourage their users to offer them free marketing.

    Do you know how to count the number of commands in Stata? I did so for SAS and found that R was adding more new commands in one year than SAS added in 40 (http://r4stats.com/2013/03/19/r-2012-growth-exceeds-sas-all-time-total/). I suspect the same is true for Stata. The commands are directly comparable of course, but it is a measure that puts R phenomenal growth in perspective.

    Cheers,
    Bob Muenchen

    ReplyDelete
    Replies
    1. Hi Bob, I wanted to mention that I looked more closely at the google trends data and am not convinced I believe the graph any more. The primary problem is that the top search result by far is \r which is a character likely referenced in many languages.

      As for the number of commands in Stata. I am pretty sure that number is very low as well compared with R. It is just so much harder to write commands in Stata syntax. I have had extensive experience with it yet still find it difficult and counter intuitive. It is all because of the temperamental way Stata parses commands using spacing and ordering.

      Delete
    2. As for the list of Stata packages. I am not sure if there is any comprehensive public list. Within Stata I searched for "search package, all" which returned 499 results and "search command, all" which returned 697 results. I copied the first line of each of the list which lists just the name or url and removed duplicates. My list has 1325 items, I have included it here with commas separating each value. It is worth noting that packages in Stata are not the same as packages in R. All packages that I am aware of in Stata only have a single command. In R, it is almost always the case that multiple functions/commands are bundled within the same package.

      Looking at the base folder in Stata, I searched for *.ado and found a little over 23 hundred ado files which are individual commands. However, some if not many of these ado files are shortcuts for other commands. Thus many commands are counted too many times such as reg regr rege reges regress all are the same command.

      I also wanted to say that I don't think it was entirely foolish for Stata to not embrace the blogging community and in particular my blog. I started blogging in Stata and initially I received several encouraging emails from the corporation I eventually ended up writing some posts which were unflattering to the organization. Some of which they responded to.

      However, by the time it came to me starting the aggregator, I think they already had decided as a cooperation to distance themselves from the blogging community since I was not the only blog out there which had written posts in favor or R over Stata.

      List of Stata commands
      https://gist.github.com/EconometricsBySimulation/9958775

      Delete
  5. In the cost comparison for SPSS, that $5430 is only for the base version (commercial license). It climbs to $16k+ for the 'Premium' commercial version of SPSS. Ouch. A commercial license SAS is similarly expensive, once you're beyond the basics. And a 'student' academic SAS license does expire. The equivalents to all that are, of course, entirely free in R.

    ReplyDelete