Tuesday, May 28, 2013

dapply - tapply for dataframes

# Every so often I run into a situation when I find I need a function that seems to me that it should be very straightforward but because it seems so much so, it also becomes extremely difficult to find online.
 
# In this case, I am interested in a function that applies a function across and index of values and returns the results of that function as an equal length as the vector fed into the function.
 
# In the search for this specific function, I found the base function tapply(vector, index, function) which does very close to what I want.
 
# However, tapply returns a vector of length equal to the number of factors in the index value.
 
# From the tapply function there might be a very easy way to get what I want.
 
# The reason I want this function is for data analysis purposes.  Often times we might want to see how well a variable that changes on the individual level (such as student education) might affect a variable that changes on the district level (per pupil spending).
 
# However, nothing occured to me.  So I programmed up an alternative function which has a few more bells and whistles.  (Check out the comments below for efficient coding methods).
 
# So hear it is.
 
dapply <- function(v1, index, fun, data=NULL, each=F) {
  # If a data.frame or list is defined then grab v1 and index from them.
  if (is.null(data)==F) {
    # The deparse and substitute commands together
    print(deparse(substitute(v1)))
    v1 <- data[[deparse(substitute(v1))]]
    index <- data[[deparse(substitute(index))]]
  }
  # Allow index to be a pattern in which case it is replicated accross the range of v1.
  if (length(index)<length(v1)) {
    if (length(index)*2>length(v1)) print(paste("Warning: Input vector is less than the index in length but not greater than twice."))
    # Calculate the number of replications required to duplicate the length of v1 with the index.
    ndup <- length(v1)/length(index)
    # If the number is not a whole number warn user.
    if (!is.integer(ndup)) print(gettextf("Input vector is of length %i can not be evenly divided by index of length %i.  Pattern truncated at end.", length(v1), length(index)))
    # Repeat the index as neccessary
    if (!each) index <- rep(index, times=ceiling(ndup))[1:length(v1)]
    if (each)  index <- rep(index, each =ceiling(ndup))[1:length(v1)]
  }
 
  # Calculate the vector which will be returned.
  vout <- tapply(v1, index, fun)
 
  # Expand the vout to cover multiple returns 
  returner <- NULL
  for (i in 1:length(index)) returner <- c(returner, vout[index[i]==names(vout)])
  returner
}
 
# For the vector 1:100 we will find the sum of the each block of twenty numbers.
dapply(1:100, ceiling(1:100/20),sum)
 
# We could have instead specified the index from 1:5 and by specifying each=T it tells it to multiply each element by what is neccessary to equal the length of the input vector.
dapply(1:100, 1:5, sum, each=T)
 
# Without each we will instead have the sum of all of the every fifth number.
dapply(1:100, 1:5, sum)
 
# But more in line with the uses I am looking for let's first construct a sample data frame:
district <- ceiling(1:1000/100)
stud.achievement <- rnorm(1000)+district*.1
sampleframe <- data.frame(id=1:1000, stud.achievement, district)
 
# Now let's see if we can find average student achievment by district and save it as a new variable.
sampleframe$Ave.ach <- dapply(stud.achievement, district, mean, data=sampleframe)
 
# Note, this function works very similarly to the egen command in Stata.
Syntax Highlighted by Pretty R at inside-R.org

# Gethub repository.
https://github.com/EconometricsBySimulation/2013-05-28-dapply

17 comments:

  1. Have you looked into the either:
    1) ?apply
    2) The reshape/plyr pacakges?

    ReplyDelete
    Replies
    1. Thanks for writing. I have not looked into the reshape/plyr packages. However, I am very familiar with apply. I am not sure how it can be used to address the examples I have posted. Please post an example. Thanks!
      Francis

      Delete
    2. For instance, by using plyr, you can replicate the last operation as follows. I would also recommend taking a look at plyr. It is specifically made to address the tasks you describe and more.

      # make data
      district <- ceiling(1:1000/100)
      stud.achievement <- rnorm(1000)+district*.1
      sampleframe <- data.frame(id=1:1000, stud.achievement, district)

      # add a new variable
      library(plyr)
      sampleframe <- ddply(sampleframe, 'district', mutate, Ave.ach=mean(stud.achievement))
      head(sampleframe)

      Delete
    3. I believe Tal Galili is correct, plyr::ddply() is the function you need. It means "apply to data frame, output data frame". Hadley Wickham has a great paper explaining how it works (with übercool figures), and there's a few great blog posts out there on the topic.

      Delete
  2. For that problem, taking your last example, I would just do:
    means <- tapply(stud.achievement, district, mean)
    sampleframe$Ave.ach <- means[match(as.character(district), names(means))]

    ReplyDelete
  3. Hi, I'm running R 2.15.1 and your second and third examples {dapply(1:100, 1:5, sum, each=T) and dapply(1:100, 1:5, sum)} are incorrectly triggering the warning "Input vector is of length 100 can not be evenly divided by index of length 5. Pattern truncated at end."

    I believe this is because the line ndup <- length(v1)/length(index) does not cast as an integer (which you test with is.integer()). I think a better way to do this test would be: if (length(v1) %% length(index) == 0) { "message here" } using modulo arithmetic.

    Nice work, have you considered putting this up as a gist on github so others can easily access it and contribute?

    ReplyDelete
    Replies
    1. Thanks for the comment and coding correction!

      I will look into this. It is a good idea about the github. I suppose I should move all of my posts that direction for easy modification and replication.

      Delete
    2. Here is the git:
      https://github.com/fsmart/dapply

      Delete
    3. That link didn't work for me. When I searched for your github username, I found the repo at https://github.com/EconometricsBySimulation/2013-05-28-dapply.

      Thanks for moving your code to github, it makes it much easier to use and test your function.

      Delete
    4. You can also use
      if (!all.equal(ndup, as.integer(ndup)))
      instead of
      if (!is.integer(ndup))

      Delete
    5. Sorry about the git url issue. I ended up changing my identity soon after first setting it up.

      Thanks Mark for the suggestion. Perhaps we should all move to git management and start making changes that will improve the code that way. Might produce something of value though as many have said I guess the plyr package has the ability to do these kinds of tasks.

      Delete
  4. Note that something similar can be achieved with `ave` in base R, eg:

    sampleframe$Ave.ach2 = with(sampleframe, ave(stud.achievement, district, FUN=mean))

    ReplyDelete
    Replies
    1. Thanks so much Charles! I know there is always a way to do these things so much easier.

      Delete
  5. Thanks for writing this up, however you are definitely reinventing the wheel here. Plyr can definitely do what you want, and much, much more. Please don't just ask for a code example, rather go and read the extensive documentation, help, quickstarts and tutorials:

    http://lmgtfy.com/?q=plyr


    ReplyDelete
    Replies
    1. I find myself reinventing the wheel a lot :)

      It's the best way to figure out how it works the way I figure :)

      Delete
  6. Changing the last bit so that memory allocation events are minimized in the last section speeds this up 4 to 5 fold.


    # Expand the vout to cover multiple returns
    returner <- integer(length(index))
    r_names <- character(length(index))
    for (i in 1:length(index)) {
    returner[i] <- vout[index[i]==names(vout)]
    r_names[i] <- attr(vout[index[i] == names(vout)], 'names')
    }
    names(returner) <- r_names
    returner

    ReplyDelete