Sunday, September 23, 2012

Program Your own Bootstrap Command #R#

# A similar bootstrap command is written in Stata: see http://www.econometricsbysimulation.com/2012/07/write-your-own-bootstrap-command.html

# The bootstrap command is an extremely powerful command that resamples from you sampling distribution in order to estimate a standard error for an estimator.  Often, using the bootstrap resampling technique can yeild estimates of standard errors on estimators that are otherwise extremely difficult to estimate (using the Delta method, the predominant method for estimating standard errors for non-linear functions).

# This post will develop a simple function that takes a data source and an estimator and generates standard errors.

# First let's simulate some data.

nobs = 200

x1 = rnorm(nobs)
x2 = rnorm(nobs)/2
u = rnorm(nobs)

y = 10-x1+2*x2+u*2

# Now that we have some vectors let's join those vectors together as a data.frame
mydata = data.frame(y=y, x1=x1, x2=x2)

# A simple OLS regression will generate good standard errors.

summary(lm(y~x1+x2, data=mydata))
# We can see that x2 has a standard deviation of about twice that of x1, this is due to x2 having half the standard deviation of x1.

# As a check, let's say we would like to estimate the standard deviation of our coefficients using bootstrapping instead.

# Let's define our bootstrapping function
data.boot <- function(Input.data, command, reps, est=F, hist=F) {
  # data is the source data that will be passed into the boot strapping command
  # command is the function that will do the operation that we desire and return a vector
  # reps is the number of repetitions in the simulation

  # Let us first get our base estimates
  original.est <- get(command)(Input.data)

  # Calculate the number of estimates needed
  nest = length(original.est)

  # Make a holder matrix for our bootstrap results
  holder <- matrix(NA, ncol = nest , nrow = reps)
    # Note, the length of our original.est is the number of coefficients we will estimate standard errors for.

  # Calculate the number of observations to resample from
  nobs <- nrow(Input.data)

  # Now let's start our bootstrapping resampling loop
  for (i in 1:reps) {
    # Draw nobs uniform integers draws from 1 to nobs
    posdraws <- ceiling(runif(nobs)*nobs)
 
    # Sample a random sample from our original data set
    draw <- Input.data[posdraws,]
      # Now we should have a new data frame called draw which has nobs draws.  Observing draw it should be easy to see there there are some observations which are repeated.
   
    # Apply our command to the new data set and save it as a row in holder
    holder[i,] <- get(command)(draw)
  }

  # We have completed our bootstrap rountine.  Now we need only perform whatever statistics we want on the results.

  # Calculate the standard deviation for each column
  sds <- apply(holder,2,sd)

  if (hist==T) {
    mfrow=c(1,1)
    frame()
    if (nest<= 3) par(mfrow=c(nest,1))
    if ((nest> 3)&(nest<= 6)) par(mfrow=c(3,2))
    for (j in 1:nest) hist(holder[,j], main=paste("Estimates of ", names(original.est)[j]), xlab="")
  }
  print(rbind(original.est,sds))
  return(list(estimates=holder, sds=sds))
}

# We will define our first function that we would like to bootstrap the standard errors
example1 <- function(BSdata) lm(y~x1+x2, data=BSdata)[[1]]
  # The key is that we need the function to return a vector to the boostrap command.  Thus the subscript [[1]] restricts the returned values to only be the estimates.  It my take some manipulation to get a single vector out of an estimate.

example1.res <- data.boot(mydata, "example1", 200, hist=T)
  # This will run a bootstrap of the estimates (200 times) and make a histogram



# We can see that the standard deviation of our bootstrap estimates are similar to those of our linear model.
summary(lm(y~x1+x2, data=mydata))

# Let's generate some more complex data to test our bootstrapping on
x1 = runif(nobs)
x2 = rnorm(nobs)
x3 = exp(rnorm(nobs))
x4 = log(runif(nobs))
u = rnorm(nobs)

y = -10+2*x1+2*x2+.2*x3+1.4*x4+u*3

mydata2 <- data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4)

# Let's try something a bit more complex, first we will be estimating more coefficients
example2 <- function(BSdata) {
  # First save the results of the OLS regression into a
  a<- lm(y~x1+x2+x3+x4, data=BSdata)
  # Next save the coefficients
  coef <- a[[1]]
  # Save the r squared
  r2 <- summary(a)$r.squared

  # Specify a vector to return
  return(c(coef,r2=r2))
}
example2(mydata2)

example2.results <- data.boot(mydata, "example2", 200, hist=T)

We can see that we are able to estimate not only standard coefficients but also the standard deviation of any estimator generated from sampling.  The extreme draws in r2 is probably due to some of the sampling distributions having very low values (or high values) of the error term by chance in those extreme draws.

No comments:

Post a Comment