Thursday, April 3, 2014

Does R have too many packages?

http://Pixton.com/ic:cyne4yvy
The Homeless Econometrician
The amazing growth and success of CRAN (Comprehensive R Archive Network) is marked by the thousands of packages have been developed and released by a highly active user base.  Yet even so, one of the founders and primary maintainers of CRAN Kurt Hornik in the Autrian Journal of Statistics (2012) is asking the question, "Are There Too Many Packages?"

As I understand it, some of the primary concerns regarding the immense proliferation of packages are: the lack of long term maintenance of many packages, the superabundance of packages, the inconsistent quality of individual packages, the lack of hierarchical dependency of packages, and insufficient meta package analysis.

1 Lack of long term maintenance of packages.  This has been a challenge that I have faced when using R packages which I believe will provide the solution to my problem but these packages frequently are not maintained at the same rate as the R base system.

And how could they be?  The base system is updated several times a year while there are thousands of packages.  To update each of those packages for minor changes in the base system seems foolish and excessive.  However, as the current structure of R stands, to fail to update these packages results in packages which previously worked, no longer functioning.  This is a problem I have experienced and is frankly very annoying.

One solution might be to limit the number of packages to those which have a sufficient developer base to ensure long term maintenance.  However, this would likely stifle the creativity and productivity of the wide R developer base.

Another solution is to limit the number of base system updates in order to limit the likelihood that a package will become outdated and need updating.

A third option, which I believe is the most attractive, is to allow code to specify what version of R it is stable on and for R to act for the commands in that package as though it is running on a previous version of R.  This idea is inspired by how Stata handles user written commands.  These commands simply specify version number for which the command was written under.  No matter what later version of Stata is used, the command should still work.

I understand that such an implementation would require additional work from the R core team for each subsequent update.  However, such an investment may be worth it in the long run as it would decrease the maintenance in response to R base updates.

2 The super abundance of R packages.  The concern is that there are so many packages that users might find it difficult to wade through them in order to find the right package.  I don't really see this as a problem.  If someone wanted to learn to use all R packages then of course this task would be nearly impossible.  However, with me as I believe with most people, I learn to use new functions within packages to solve specific problems.  I don't really care how many packages there are out there.  All I care is that when I ask a question on google or StackOverflow about how to do x or y, someone can point me to the package and command combination necessary to accomplish the task.

3 The inconsistent quality of individual packages.  It is not always clear if user written packages are really doing what they claim to be doing.  I know personally I and am constantly on the look out for checks to make sure my code is doing what I think it is doing, yet still I consistently find myself making small errors which only show up through painstaking experimentation and debugging. 

CRAN has some automated procedures in which packages are tested to ensure that all of their functions work without errors under normal circumstances.  However, as far as I know, there are no automated tests to ensure the commands are not silently giving errors by doing the wrong thing.  These kind of error controls are entirely left up to the authors and users.  This concern comes to mind because one of my friends recently was running two different Bayesian estimation packages which were supposed to produce identical results yet each returned distinctly different results with one set having significant estimates and the other not.  If he had not thought to try two different packages then he would never have thought of the potential errors inherent in the package authorship.

A solution to inconsistent package quality controls may be to have a multitiered package release structure in which packages are first released in "beta form" but require an independent reviewing group to check functionality and write up reports before attaining "full" release status.  Such an independent package review structure may be accomplished by developing an open access R-journal specifically geared towards the review, release, and revision of R packages.

4 The lack of hierarchical dependencies.  This is a major point mentioned in Kurt Hornik's paper.  He looks at package dependencies and found that the majority of packages have no dependencies upon other packages.  This indicates that while there are many packages out there, most packages are not building on the work of other packages.  This produces the unfortunate situation in which it seems that many package developers are recreating the work of other package developers.  I am not really sure if there is anything that can be done about this or if it really is an issue.

It does not bother me that many users recode similar or duplicate code because I think the coding of such code helps the user better understand the R system, the user's problem, and the user's solution.  There is however the issue that the more times a problem is coded, the more likely someone will code an error.  This beings us back to point 3 in which errors must be rigorously pursued and ruthlessly exterminated through use of an independent error detection system.

5 Insufficient Meta Package Analysis.  A point that Kurt Hornik also raises is that there are a lot of R packages out there but not a lot of information about how those packages are being used.  In order to further this goal, it might be useful to build into future releases of R the option to report usage statistics on which packages and functions are being used in combination with which other packages.  Package developers might find such information useful when evaluating what functions to update.

Conclusion
Overall, it is impossible not to recognize CRAN as a huge success.  CRAN has been extremely effective at providing a database for the distribution of many R packages dealing with an myriad of user demands.  In a way, this post and the article that inspired it are only presenting the problems associated with success.  Yet, given the great success of CRAN, how should we move it forward?  This post presents some possible solutions.



Finally, I would like to say thank you to all of the fantastic R developers who have released so many packages.  I do not claim credit for any of the thoughts expressed in this post.  As a newcomer to R, I am not personally aware of the many thoughtful dialogues that must have already transpired regarding the issues raised in this post.  I am sure more thoughtful and considerate minds than mine have already given what they believe are the best solutions to the problems here raised.

19 comments:

  1. One solution to problem 1 would be to have a two-layered CRAN. In layer 1 there would be packages that are properly maintained by their authors (we would need a good protocol of what "properly" means), and in another one all other packages. That way users can look if the well-maintained packages have the functionality they are after, and if they don't then they can turn into the wider set of packages in layer 2. This also gives an incentives to authors to maintain their code because that would allow them to have their package listed in layer 1.

    ReplyDelete
  2. An excellent summary of the four points. I offer two thoughts:
    1. Might CRAN either require or urge package submitters to list related packages? Something like those who apply for patents are required to list prior art, earlier patents that are related but are sufficiently different.
    2. As to meta-package analysis, is it possible for someone with access to StackOverflow to compile the number of times packages are used over some period of time. That might be a crude indicator of what is being used and might lead to extra efforts to test those packages or maintain them.

    ReplyDelete
  3. Is 1 really a problem? You can have as many versions of R installed as you want. If you have a package that only works on R 2.15 that you really need, then keep R 2.15. But this is open source here, you can copy the function over and update it yourself. The open source part applies to 2 and 3, if an error has been made then it is in the code, and that is open to everyone.

    ReplyDelete
  4. How about a layer 0 for packages where the functions have been tested with textbook data resulting in similar results as described (under the assumption that the book contents is right...). For example (without wanting to advertise...) in my 'propagate' package I tested all results against the "Guide to the Expression of Uncertainty in Measurement" (GUM 2008), which might be viewed as a sort of "external review".

    ReplyDelete
    Replies
    1. This is great and exactly along the lines of what I was thinking.

      Delete
  5. Maybe the reason people don't use dependencies a lot is that they fear having to keep track of changes in the other packages. It's easier to copy the code (which is, after all, open-source) and then it will only change if you want to change it.

    As for the number of packages, I'd argue we have to *few* of them, not too many. I'd like to see more, but smaller, packages. Many packages contain a very wide suite of functionality. It would be nice to have a new class of "micro" packages, that try to do one thing really rather well. Then, if the solution is seen to be good, it could be folded into the base system. There are many opportunities for such work. For example, log axes in base graphics are non-standard compared with publication, and I bet dozens of people have written their own code to solve the problem, either in published or private packages. If micro packages existed, there could be a competition for someone to code a good alternative log axis, and whichever solution was deemed best by users and the R core team could make its way into base graphics.

    I suppose an alternative to micro packages would be an alternative to library() that permits importing just one function. This scheme is used to great effect in python. Heck, in python you can even rename the function (from library import NewLogAxisVersion42 as logaxis).

    ReplyDelete
    Replies
    1. I think the idea of micro packages is great. Great idea about having competitions over the top choices for specific problems as well.

      As for your thoughts on reading in individual functions. It is a good idea that can probably be implemented immediately. For example:
      aes = ggplot2::aes
      ggplot = ggplot2::ggplot
      geom_histogram = ggplot2::geom_histogram

      ggplot(as.data.frame(mydata), aes(y=b, x=b))+geom_histogram()

      However, I think most people will still opt for loading the entire library since I would imagine loading individual functions is meant to reduce demands on working memory and most people if they are using R are not going to be concerned with the demands required by loading in a few more packages.

      Delete
  6. Two layers on CRAN is a good idea, and a quantitative metric of package excellence (threshold defines membership in group 1) that includes things like number of RUnit tests per function (these should be integrated into the on-line documentation).

    I'm not a fan of encouraging package dependencies, R packages just don't encourage library-like function interfaces. Better would be a second type of "library" package devoted to functions meant to be called by other functions rather than user interface functions. This will encourage the right sort of unit tests in both the traditional "user" packages and the new library ones. It will also make for more natural function naming conventions.

    ReplyDelete
    Replies
    1. Interesting points about the non-user interface functions. I am sure you have some reference you are thinking of. Perhaps you could give an example to help flesh out your idea.

      Delete
    2. One common scenario is where the underlying functionality that you want from a package is an unexported function, so must be called with the ::: operator or via some interface function (often a generic method), which involves additional overhead. If we spent more time documenting and generalising those functions that we don't think the user needs to see, I think we'd end up with a powerful set of libraries that aren't partitioned along the lines of end-user applications but rather more abstract categories of calculations and data manipulations. There are a few benefits here. (1) Library packages will ligitimate the use of longer, more descriptive function names that can be easily searched by package developers, (2) reduce redundancy and package code length, and (3) encourage more functional programming patterns like closures, etc.

      Delete
  7. I don't see the problem with point 1. If a package only works for version 2.15 and i'm on version 3.x.x then I simply install version 2.15 again to use the package. This is far simpler than trying to update the function myself and previous versions of R are still available.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. This is a good point though I could anticipate some package x which only works on say up to 2.5 and y working on say past 2.8. In this case these two packages do not have overlap. At this point someone would need to change either package x or y. Maybe an unusual case.

      That said, I personally will probably never load a previous version of R to run a package. That seems messy and destined to create frustration in the long run. At that point I would rather write the function myself.

      Delete
  8. We do have to remember that the majority of R package developers are NOT software developers and do not want to be either. They are simply taking some code they have written and making it available to the world.
    Insisting that we put lots of checks into the code and things like this are jobs for software developers and not the typical statistician who releases their code to the world.

    ReplyDelete
    Replies
    1. I don't think any of the suggestions offered thus far would make inordinate demands from package developers. The majority of them seem to be favoring some kind of 2 level system for which packages can be released in general or receive some kind of endorsement of validity. Is there something wrong with this?

      Please, I am really sorry if I said anything offensive. I do not mean to imply that R package developers are doing anything less that remarkable.

      Delete
  9. I would argue (like Hornik) for keeping exactly the same system in place, in defence of as little constraints as possible on core R package development and diffusion. More restrictive CRANs can be built on top of the existing one, if needed, but the baseline needs maximal variety.

    The R package ecology is a highly diverse ecosystem, and should stay so at the price of anarchy outside of its core. It's from that kind of messy primeval soup that you get high quality outputs and a community of expert users to find them.

    For instance, I would offer that it is precisely because there's so many concurrent data I/O options in R that the best ones like dplyr make so much noise in the user community when they get published.

    None of that is incompatible with little niches of "stable / meritocratic / profitable R development" forming everywhere on the surface of the anarchic, unregulated layer. This is pretty much what happened with Linux and it seems to be working for R too.

    Also, you want amateur/freelance developers to feel as welcome as possible, because they're basically working for free improving your professional software all day long by producing all sorts of funny things, some of which are actually quite brilliant :)

    tl;dr — leave the gatekeeping to other institutions, don't artificially sparsify the package forest

    ReplyDelete
  10. To help people find packages, allow package creators to attach keywords or phrases to their packages, such as "garch", "vector autoregression", "wavelets" etc and allow people to search for CRAN packages by keyword. This is in the spirit of the task views but would not require any manual labor once created.

    ReplyDelete