Thursday, September 20, 2012

Stata is to Accounting as R is to Tetris


Both Stata and R handle many of the same data computation needs.

However, researchers must subset data within them very differently.

For Stata subsetting can be extremely easy.

If you want to restrict your data you can simply post after most commands an if statement.
 
  * Stata code
  clear
  set obs 100
  gen y = rnormal()
  gen x1 = rnormal()
  gen x2 = rnormal()
  gen u = rbinomial(1,.5)
  reg y x1 x2 if s==1

Thus the OLS regression of y on x1 and x2 will only occure if s=1.  In R this operation can be a little more tricky. Imagine you have a data set called mydata which has four variables y x1 x2 s.  The easiest way to restrict the data would probably be to create a new subset data set.

  # R code
  # Create your data set
  mydata = data.frame(y=rnorm(100), x1 =rnorm(100), x2 =rnorm(100), s = rbinom(100,1,.5))
  # Create a sub-set of your data by specifying the subset mydata[mydata$s==1,]
  lm(y~x1+x2, data=mydata[mydata$s==1,])

Thus same operation as above.  

Those of you with less experience in R are probably wondering how using brackets accomplished the same task.

In R like in Stata you use brackets to indicate subcripts.

For instance:  The vector "letters" is a built in environmental vector in R that contains all of the letters from a to z.

Thus:

  letters[1] # Displays "a"
  # Vectors can also be subscripted by vectors (with repetition)
  v = c(1,2,3,2,1,5)
  letters[v]
  # Displays "a" "b" "c" "b" "a" "e"
 
  # Vectors can also be subsetting using logical operators.
  vv = rep(c(TRUE,FALSE),13)
  # Creates a vector 26 elements long alternating between TRUE and FALSE
  letters[vv]
  # Will display every other letter starting with a.
 
  # This brings us back to how we subsetted a dataframe.
 
  # Let's make a new data frame called mysample
  mysample = data.frame(a = letters, b = 1:26, c = rnorm(26))
 
  # We can subset the data frame by using two subscripts now
  mysample[4,2] # Displays 4
  mysample[3,]  # Displays an entire row of the data frame.
  mysample[vv,] # Will display every other row
  # Replacing subsets is notated in a similar manner as subsetting
  mysample[vv,1] <- "z" # Will replace every other letter with "z"
 
  # Thus mydata[mydata$s==1,] is telling R to use any row in which variable s of data frame mydata is equal to 1.
 
At this point you are probably thinking that R is overly complicated and that Stata handles data much better.  This is not true.

Stata handles data in a manner similar to that of an accountant.  If you want your accountant to add within rows different values there is no problem.  You can even use subscript to move values from one row to another.  R on the other hand takes data and transforms it and combines it into new forms often much easier than Stata but with more complex notation.

1 comment:

  1. I never knew even a mid range accounting software could be so complicated. I mean, it's just a stark contrast to how easy it looks from a general user's standpoint.

    ReplyDelete