Wednesday, April 16, 2014

The awesome folks at Quandl (an amazing data collection and distribution service) have been so kind as to allow me to write for their blog.

In my first post for them I demonstrate (with detailed R code) how a user of their free data services could easily compare the relationship between time series data in different markets.  In my particular post I compare the bitcoin (sometimes mistakenly referred interchangeably with bitstamp) market with that of the gold market.

Find the post, "Investigating the relationship between gold and bitcoin prices useing R" here.

Monday, April 14, 2014

Public Universities Should Use Open Source Software

The Homeless Econometrician: Black Box Software

Choosing to go open source is a big deal.  It means that when asking for help or new improvements you are dealing with a highly active and generous user community which enjoys helping instead of a corporation which needs to justify the expense of providing customer service by extracting fees either directly or indirectly (through workshops, "support costs", etc.).

1. Open source platforms attract a different kind of user.  Frequently university affiliated, open-source users tend to be much more generous with access to their work such that it may be possible to piggy back on the work developed by other universities or agencies which already ask many of the same questions you would like to ask.  This is almost certainly not to be the case when using proprietary software primarily deployed by corporations which will in contrast guard access to their resources and copyright their content.

2. Open source software can have higher quality than proprietary software. Unfortunately the open source movement has gotten a bad reputation as producing inferior products at no cost.  This however is not a fair assessment.  Sure products start out as low cost but remember Firefox.  Up until the development of Google Chrome, Firefox was far and away the best browser you could use (after spending time using Chrome, I still think it is).

3. Publicly funded research should be used to support public goods. There is a philosophical additional reason for using an open source product is that just through use you are promoting the entire purpose of publicly funded research. By using any product indirectly or directly you contribute to justifying its continued existence and making it better.  If you use proprietary software then you are using public funds to support a private agency.  In contrast, using supporting open source software uses public funds to support a public good.

4. Open source software is often better maintained than proprietary software. Open source platforms do not rely upon the sluggish updating process of a corporation’s software development department. Other popular open source deployments have rapidly outpaced their corporate rivals.  R, an open source stastical analysis software, for instance has its base package updated 8 to 16 times more frequently than proprietary rivals.

5. Popular open source software is often far more flexible than proprietary software.  Open source software tends to frequently exceed the bounds of its designers imaginations while proprietary software rarely does.  For example, both R and Wordpress have an order of magnitude more options available than many of their rivals (60 thousand commands compared with 1 to 2 thousand for R and 30 thousand plug-ins compared with approximately 1000 for Blogger).

6. Open source software is transparent. There is no black box when it comes to open source software. By definition all of the code is available for users to access.  This removes the ambiguity of never really able to verify what exactly your software is doing.

7. Open source software if free.  This is often one of the most commonly chosen reasons to use open source software.  However, many research projects have large budgets and can afford to buy proprietary software.  This may be true for you, but can your collaborators afford to buy that same software and what about their collaborators, what about your students?  What will happen to your project if your funding gets short?

Thanks for reading this far.  Some of the open source software that I use are:
For many computation tasks: R
For high powered computation: Julia
For in field data collection: Open Data Kit
For online assessments: The Concerto Platform

Tuesday, April 8, 2014

Getting Social Sciences Out of the Black Box: The Open Access Revolution

Trading Ethos for Logos
Up until very recently (the last 10 years) it has been uncommon for social science researchers to share their data even when the sharing would neither compromise the private information of the subjects nor the validity of the study. [1]

Even more uncommon is it for researchers to willingly share the code they used to transform their data from its raw state to the state required to produce their analysis.

Researchers instead refer to what is often an incomplete "write up" included in their published work as reference for inquiries.

As a matter of course, they by and large do not publicly recognize the possibility of coding inconsistencies unintentionally or intentionally created as existing except if it were the case in which an error was so large or gross that a reviewer could detect it through superficial examination.

Of course, as every programmer knows, errors (or bugs) inevitably pop up and are only removed through a combination of careful examination of code, experience at error detection, and luck.  For researchers to not share both their code and their data is therefore an act of unscientific hubris at best, deceit and publication chasing at worst.

Even when code is shared, the code which is shared is often defaced and unintelligible without comments or instructions on how to connect the data to the code.  I have experienced this situation first hand, the result being that unsurprisingly it is usually more energy efficient to recode the entire analysis from scratch rather than attempting to interpret the baffling and intractable code of another.

Furthermore social science research is generally ranked by the number of citations it inspires, a measure more related to its ability to create a splash, than its truthfulness.  Thus it might be profitable for a researcher to exaggerate or selectively pick results that further the researcher's position at the expense of identifying true effects or results.

In this paradigm, even when it became widely known that a paper employed faulty methods, the critique of that research needs to undergo the hurdle of being presented or published in a secondary source at some later point in time resulting in a large scale failure to disseminate corrections to analysis.

It is ironic, that in a profession (economics) that takes as its base assumption that humans are utility maximizing and only moral when it is cost effective, that such a lax incentive structure would have developed.

The result of this sordid academic publishing incentive system is a lemons market for academic research (See Recent Economist's Article) in which it is believed that there is good research "out there", but it is uncertain how that research is being done and by whom.

So what is the solution?
Enter the Open Access Revolution. This revolution is characterized by a new structure of analytical work heralded by a rise in open access journals, open data providers, requirements for the sharing of data by mainstream journals, and the general proliferation of shared research methods through online communities.  Open access journals in particular are exceedingly helpful in improving the ethics of scientific research by providing standards in which research, data, analysis, as well as critiques are required to be shared.

The hope is that by transitioning from black box research methods to a system in which information is shared, the lemons market will be squeezed to death as unethical and sloppy research is crushed out of the market.  Not to say that all "bad" research is the results of unethical behavior or lack of carefulness in coding or analysis.  It is simply impossible to know what the quality of the research being done is until it can be assessed externally.

Open Source Statistical Programming Languages (R, Julia, etc.)
In order to accomplish the full dissemination of research and research methods not only is the sharing of data and code ideal but the sharing of the means of running the analysis.  To this end I believe open source programming languages are a highly effective tool.  Analysis done in these languages and shared through publications is highly accessible to any researcher, even researchers who do not have a background in that language, since anybody can install these languages for free, run the code, and get help from their highly active communities.

This however may not be the case with other statistical programming languages which present large barriers to analysis by requiring the purchase of expensive or difficult to find software.  In addition, other software options may not be as well supported as these programming languages, resulting in a situation in which even if you have someone else's code, there might not be a support network available to help debug and service that code.

It is therefore my belief that using an open source statistical programming languages should become the standard in statistical analysis and the setting of standards for scientific research, especially in the highly subjective fields known as the "social sciences".

Data Aggregators: (Quandl, the Social Science Research Network, etc.)
In order to further the goal of open access to data, data aggregators and in particular Quandl have emerged.  These independent organizations have taken on the mammoth responsibility of gathering, organizing, and redistributing publicly available data for the purpose of advancing scientific research. 

Combining code from an open source statistical programming language and the data from Quandl, a new possible data analysis paradigm has emerged which is as different from the old paradigm as mountain spring water is from muddy puddles.  In this new paradigm, it is possible to have every piece of an analysis including the access and downloading of data, coded, and openly available for review and duplication.  Students and fellow researchers will not have to trudge through the difficult and often impossible work of trying to rebuild the analysis of others based on their published works.

For the sake of full disclosure, there is the possibility that I will be a guest blogger on the Quandl blog in the near future.  Thus my favoring of Quandl in this post, might be influenced by that anticipated opportunity.  That said, I find Quandl a truly remarkable undertaking and hopefully a game changer for the standards in future research.

[1] Purely personal observation.  I do not have data to support this claim.

Thursday, April 3, 2014

Does R have too many packages?
The Homeless Econometrician
The amazing growth and success of CRAN (Comprehensive R Archive Network) is marked by the thousands of packages have been developed and released by a highly active user base.  Yet even so, one of the founders and primary maintainers of CRAN Kurt Hornik in the Autrian Journal of Statistics (2012) is asking the question, "Are There Too Many Packages?"

As I understand it, some of the primary concerns regarding the immense proliferation of packages are: the lack of long term maintenance of many packages, the superabundance of packages, the inconsistent quality of individual packages, the lack of hierarchical dependency of packages, and insufficient meta package analysis.

1 Lack of long term maintenance of packages.  This has been a challenge that I have faced when using R packages which I believe will provide the solution to my problem but these packages frequently are not maintained at the same rate as the R base system.

And how could they be?  The base system is updated several times a year while there are thousands of packages.  To update each of those packages for minor changes in the base system seems foolish and excessive.  However, as the current structure of R stands, to fail to update these packages results in packages which previously worked, no longer functioning.  This is a problem I have experienced and is frankly very annoying.

One solution might be to limit the number of packages to those which have a sufficient developer base to ensure long term maintenance.  However, this would likely stifle the creativity and productivity of the wide R developer base.

Another solution is to limit the number of base system updates in order to limit the likelihood that a package will become outdated and need updating.

A third option, which I believe is the most attractive, is to allow code to specify what version of R it is stable on and for R to act for the commands in that package as though it is running on a previous version of R.  This idea is inspired by how Stata handles user written commands.  These commands simply specify version number for which the command was written under.  No matter what later version of Stata is used, the command should still work.

I understand that such an implementation would require additional work from the R core team for each subsequent update.  However, such an investment may be worth it in the long run as it would decrease the maintenance in response to R base updates.

2 The super abundance of R packages.  The concern is that there are so many packages that users might find it difficult to wade through them in order to find the right package.  I don't really see this as a problem.  If someone wanted to learn to use all R packages then of course this task would be nearly impossible.  However, with me as I believe with most people, I learn to use new functions within packages to solve specific problems.  I don't really care how many packages there are out there.  All I care is that when I ask a question on google or StackOverflow about how to do x or y, someone can point me to the package and command combination necessary to accomplish the task.

3 The inconsistent quality of individual packages.  It is not always clear if user written packages are really doing what they claim to be doing.  I know personally I and am constantly on the look out for checks to make sure my code is doing what I think it is doing, yet still I consistently find myself making small errors which only show up through painstaking experimentation and debugging. 

CRAN has some automated procedures in which packages are tested to ensure that all of their functions work without errors under normal circumstances.  However, as far as I know, there are no automated tests to ensure the commands are not silently giving errors by doing the wrong thing.  These kind of error controls are entirely left up to the authors and users.  This concern comes to mind because one of my friends recently was running two different Bayesian estimation packages which were supposed to produce identical results yet each returned distinctly different results with one set having significant estimates and the other not.  If he had not thought to try two different packages then he would never have thought of the potential errors inherent in the package authorship.

A solution to inconsistent package quality controls may be to have a multitiered package release structure in which packages are first released in "beta form" but require an independent reviewing group to check functionality and write up reports before attaining "full" release status.  Such an independent package review structure may be accomplished by developing an open access R-journal specifically geared towards the review, release, and revision of R packages.

4 The lack of hierarchical dependencies.  This is a major point mentioned in Kurt Hornik's paper.  He looks at package dependencies and found that the majority of packages have no dependencies upon other packages.  This indicates that while there are many packages out there, most packages are not building on the work of other packages.  This produces the unfortunate situation in which it seems that many package developers are recreating the work of other package developers.  I am not really sure if there is anything that can be done about this or if it really is an issue.

It does not bother me that many users recode similar or duplicate code because I think the coding of such code helps the user better understand the R system, the user's problem, and the user's solution.  There is however the issue that the more times a problem is coded, the more likely someone will code an error.  This beings us back to point 3 in which errors must be rigorously pursued and ruthlessly exterminated through use of an independent error detection system.

5 Insufficient Meta Package Analysis.  A point that Kurt Hornik also raises is that there are a lot of R packages out there but not a lot of information about how those packages are being used.  In order to further this goal, it might be useful to build into future releases of R the option to report usage statistics on which packages and functions are being used in combination with which other packages.  Package developers might find such information useful when evaluating what functions to update.

Overall, it is impossible not to recognize CRAN as a huge success.  CRAN has been extremely effective at providing a database for the distribution of many R packages dealing with an myriad of user demands.  In a way, this post and the article that inspired it are only presenting the problems associated with success.  Yet, given the great success of CRAN, how should we move it forward?  This post presents some possible solutions.

Finally, I would like to say thank you to all of the fantastic R developers who have released so many packages.  I do not claim credit for any of the thoughts expressed in this post.  As a newcomer to R, I am not personally aware of the many thoughtful dialogues that must have already transpired regarding the issues raised in this post.  I am sure more thoughtful and considerate minds than mine have already given what they believe are the best solutions to the problems here raised.

Tuesday, April 1, 2014

Stata Fully Mapped into R

Hello all of you Stata loving statistical analysts out there!  I have great news.  I am finally nearly done with the package I have been working on which provides the mechanism for Stata users to seamlessly move from Stata to R though use of my new package "RStata"!

In this package I have taken 150 of the most commonly used commands in Stata and directly mapped their syntax into R.  Not only can they be called using identical syntax but they also return identical arguments to the active window.  In order to accomplish this task, the package has built in dependencies on many useful R packages such as plry, ggplot2, glm, etc.  So installation could take a while.

To see this new package in action, here is some sample code:


sysuse auto
regress mpg weight c.weight#c.weight foreign

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  3,    70) =   52.25
       Model |  1689.15372     3   563.05124           Prob > F      =  0.0000
    Residual |   754.30574    70  10.7757963           R-squared     =  0.6913
-------------+------------------------------           Adj R-squared =  0.6781
       Total |  2443.45946    73  33.4720474           Root MSE      =  3.2827

              mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
           weight |  -.0165729   .0039692    -4.18   0.000    -.0244892   -.0086567
c.weight#c.weight |   1.59e-06   6.25e-07     2.55   0.013     3.45e-07    2.84e-06
          foreign |    -2.2035   1.059246    -2.08   0.041      -4.3161   -.0909002
            _cons |   56.53884   6.197383     9.12   0.000     44.17855    68.89913

Please note, I only have a licensed version of Stata up to version 11 so newer commands are omitted from the package.

If you would like to beta test this package or contribute to mapping additional Stata commands, you can find it and installation instructions at:

Thursday, March 27, 2014

Why use R? Five reasons.

In this post I will go through 5 reasons: zero cost, crazy popularity, awesome power, dazzling flexibility, and mind-blowing support. I believe R is the best statistical programming language to learn. As a blogger who has contributed over 150 posts in Stata and over 100 in R I have extensive experience with both a proprietary statistical programming language as well as the open source alternative.  In my graduate career I have also had the opportunity to experiment with the proprietary software SPSS, SAS, Mathematica, as well as MPlus.

I would like to thank the bloggers Robert A. Meunchen and David Smith whose informative blog posts were useful in researching this post.  I would also like to thank my fiance Jennifer Cairns for suggesting the idea of this blog post.  Perhaps some day you will give up your Stata loving ways.

R is Free Bob

Free for you
Not much needs to be said about this point.  R is free to install, use, update, clone, modify, redistribute, even sell.  You can find installation for R with Linux, Max OS X, and Windows at (CRAN).  Choosing to code in R can help you avoid a significant blow to your check book.

Free for everyone
Even if you have the budget to afford any software package you desire it is unlikely everybody you end up working with will have such a lucrative budget.  Failures to have equal access to software can pose significant barriers to collaborative work.  Not so long ago I was a research assistant on a project in which most students ended up having to purchase licenses in order to work on the project on their personal computers.  I believe each of the students ended up spending around $400.

Table 1: Pricing of Perpetual Usage Licenses (with 1 year support if available)

Stata (SE)
Quote Requested

So sure, it is generally inexpensive to get a student copy for some software but be prepared to either pay out large in the future or have your company pay out hugely.  In addition, don’t be mistaken into thinking these are onetime costs.  Most statistical packages only provide “support” for the two most recent major releases.  Thus, if you want your product to be supported, able to install new commands, or even access data saved by later versions, you must continue to purchase new versions of the software as it is released.

Looking at this graph it is clear that except for SAS, major version release rates of proprietary software is very high meaning that purchasing a perpetual user license only provides you with access to the most recent version of the software for on average a year or two years.

Major Version Release History Table


From this table one may incorrectly assume that R is not as well maintained as proprietary software.  However, this table only tracks major releases.  A better measure is the number of subversion releases/updates.  Most statistical software only releases one or less subversions per major release (Stata 1.0 subversions, SPSS 0.8 subversions, Statistica 0.25 subversions since 1999) compared with R which has released on average 16.3 subversions per major release starting in 2000.  This means that since inception R has already gone through 52 versions.  This is a much higher rate of updating than any proprietary software options. 

R is Popular

Though it is a little challenging to figure out exactly what rubric to used when measuring popularity, R seems to be growing rapidly in popularity among general users as well as employers .  Robert A. Meuchen has an excellent blog where he examines the ongoing popularity of data analysis software.  
R job prospects are rapidly increasing comparing R against a host of alternative software.  The search for R is complicated by the difficulty of its ambiguous name.  Graph (borrowed from  Robert A. Meuchen’s blog).

In addition to what graphs presented on are Meuchen’s blog, I also include the Google Trends' search results which he does not.  In his post he does not because he is concerned that there is too much noise when searching for “R”.  However, google is testing a trends feature that allows it to pick up on “R Programming Language” which I believe should be much more accurate than other previous searches which might confuse “R” with Toys “R” Us etc.

Popularity of Google Searches for Statistical Software

But why do we care how popular R is?  Programming languages (which all statistical software worth their salt have) are highly dependent upon their user base in order to develop.  How fast they develop, how powerful they are, and how long they expect to be supported is entirely based on how widely they are used.

R is Becoming the Standard 
In a world in which time is limited and that involved in learning a statistical package is nontrivial, learning to program in a system that is unpopular or unsustainable can be futile and frustrating.  I will not make any predictions as to the life expectancy of any propriety software options out there except to say, there are a lot of expensive options in a market in which the most competitive option (R) is free.  I do not know how long proprietary options will be around, but some version of R is likely to remain popular for the indefinite future.  R is well maintained by an active and highly talented community.

Thus, as the emerging standard for statistical programming, it is likely to be a highly rewarding process (both fiscally and in terms of opportunities) to learn to use R.

R is Popular with Employers
In two recent studies including one of over 17,000 technology professionals, R was the highest paid technical skill with an average salary of 115,531 (Read more on this here).

R is Powerful

R can handle complex and large data
Here are some useful articles related to large data and R:
3. Tips on Computing with Big Data in R

R can easily program complex simulations
This is my area of expertise.  On my blog I have over 100 releases or code in R and 170 releases in Stata.  The majority of these posts feature some kind of simulation.  In general I find Stata to be faster to work with when doing simple simulations.  However, if there is any kind of complexity involved in the simulation, working with Stata is a nightmare compared with R.  I have also worked with simulations in SAS (for my master’s thesis).  SAS was by far the most frustrating programming experience of my life.

R can be used on High Performance Computer Clusters
High performance computer clusters are large computer clusters (often in universities) which manage the processing capacity of hundreds or thousands of processors simultaneously.  The systems are able to crunch through simulations, data, or analysis at a rate much higher than that typically achievable on most individualized systems.  Thus problems which could take weeks to solve on your personal computer might take an HPCC user account only hours.

The way these systems do this is by distributing individual tasks across many different processors.  In the event of using proprietary software, unless there is a special license designed just for this purpose (of which I am not aware of any except possibly SAS or SPSS by using a third party toolkit). 
Proprietary systems are inherently limited by the nature of their architecture and their price.  .  So even if you have access to an HPCC cluster, you could find it cost prohibitive to purchase.
R supports multicore task distribution
In the modern world nearly all computers come with multiple cores yet traditional programming techniques assume only one processor.  Most statistical programming languages have responded to the new structure of programming by allowing for individual commands to be threaded out to different cores.  Some of these programs such as Stata actually make you buy a different license based on the number of cores you are using (perpetual without “maintenance” 1 core $ 1,695, 2 cores $2,495, 4 cores $ 3,125, 6 cores $3,410, 8 cores $3,695, 12 cores $4,160, 16 cores $4,625, 24 cores $5,045, 32 cores $5,460, and 64 cores $6,445). 

Expected speed gains from parallel processors (from Stata)

Let’s compare the pricing scheme to the expected increase in performance.  We can see that going from 1 processor to 8 processors (a $2,000 cost increase) only increases processing speed for most commands by an expected rate around 3x faster.  This is a significant increase in speed though an even more significant increase in software cost.

This would not be so surprising and disturbing if a major portion of the advances in computer speed were not based on multicore processing.  Individual cores have generally gained little in speed over the last 5 years for instance while multicore processing has become very popular.  Therefore, for the average user who only pays $1,600 for their license of Stata, more likely than not, his or her version will be running into similar hardware challenges as would have been face half a decade ago.

For similar gains in speed through automatically distributing computational tasks in R, a spinoff has been developed called pqR (pretty quick R).  In addition, there are a number of packages developed that allow for multicore distributed tasks.  I personally have not experimented with them.

My overall take away in terms of multicore processing is that Stata is going to be easier to see gains in speeds from having more cores (if you are willing to shell out the cash), however it is possible to accomplish similar gains in speed using a non-mainstream flavor of R called pqR or by implementing one of the several packages.

R is Flexible

From complex or standard statistical practices, to baysian modelling, to GIS map building, to building interactive web applications, and to building interactive tests.   

Here are some resources:
1. Statistical/Econometrics Models
  a. Basics
  b. Panel data
      i.    Panel Data Using R
      ii.   Econometrics Academy – Software R
      iii.  PLM Vignette 
  c. Time series
  d. Spatial Econometrics
A wonderful feature of proprietary software is the software’s paid help line.  When I had a supported version of Stata I could send an email to Stata Corp and get a response usually within a few hours.  This feature significantly increased my productivity and ultimately was something I felt would find difficult to live without when I first contemplated switching to R.  However, as I became more aware of the resources available in R, I realized that the support I could find for my R questions was much faster and more thorough than that typically even available through my paid service at Stata.

      StackOverflow (SO)
SO is a mind blowing resource!  This Question and Answer structured website is the place to find answers for all of your coding questions.  Most questions I come up with someone else has already asked and I can usually find them answered on the site.  New questions though are even better!  The average response time to answer the questions I ask on SO seems to be less than 10 minutes.  Plus, they have a great reputation system in which people who ask good questions and people who give good answers are rewarded.  Every time I post a question on SO, I am astonished again by how fast it is answered.  Looking back on the Stata help line, I realize that it would be hard to go back to such a sluggish system.

I have been a reader of R-Bloggers for a couple of years now and I am continually learning something new.  As a blog aggregator of over 450 contributing blogs, it is an excellent way to learn new or refresh old skills in R.  In addition, as a blogger, it is a great way to expand your audience.  R-Bloggers also provides a resource by storing blog posts from blogs that have been retired or are no longer active.  As another comparison with Stata, I started a Stata “blog aggregator” ( after getting involved and excited by R-Bloggers.  I could only find a handful of blogs on Stata.  In addition, humorously, the official Stata blog was uninterested in being aggregated on the aggregator, providing what seemed to me a signal that Stata Corp was not interested in supporting a general Stata blogging movement.

Maureen Tippel, Greg Kraft, MKT Management