# The bootstrap command is an extremely powerful command that resamples from you sampling distribution in order to estimate a standard error for an estimator. Often, using the bootstrap resampling technique can yeild estimates of standard errors on estimators that are otherwise extremely difficult to estimate (using the Delta method, the predominant method for estimating standard errors for non-linear functions).

# This post will develop a simple function that takes a data source and an estimator and generates standard errors.

# First let's simulate some data.

nobs = 200

x1 = rnorm(nobs)

x2 = rnorm(nobs)/2

u = rnorm(nobs)

y = 10-x1+2*x2+u*2

# Now that we have some vectors let's join those vectors together as a data.frame

mydata = data.frame(y=y, x1=x1, x2=x2)

# A simple OLS regression will generate good standard errors.

summary(lm(y~x1+x2, data=mydata))

# We can see that x2 has a standard deviation of about twice that of x1, this is due to x2 having half the standard deviation of x1.

# As a check, let's say we would like to estimate the standard deviation of our coefficients using bootstrapping instead.

# Let's define our bootstrapping function

data.boot <- function(Input.data, command, reps, est=F, hist=F) {

# data is the source data that will be passed into the boot strapping command

# command is the function that will do the operation that we desire and return a vector

# reps is the number of repetitions in the simulation

# Let us first get our base estimates

original.est <- get(command)(Input.data)

# Calculate the number of estimates needed

nest = length(original.est)

# Make a holder matrix for our bootstrap results

holder <- matrix(NA, ncol = nest , nrow = reps)

# Note, the length of our original.est is the number of coefficients we will estimate standard errors for.

# Calculate the number of observations to resample from

nobs <- nrow(Input.data)

# Now let's start our bootstrapping resampling loop

for (i in 1:reps) {

# Draw nobs uniform integers draws from 1 to nobs

posdraws <- ceiling(runif(nobs)*nobs)

# Sample a random sample from our original data set

draw <- Input.data[posdraws,]

# Now we should have a new data frame called draw which has nobs draws. Observing draw it should be easy to see there there are some observations which are repeated.

# Apply our command to the new data set and save it as a row in holder

holder[i,] <- get(command)(draw)

}

# We have completed our bootstrap rountine. Now we need only perform whatever statistics we want on the results.

# Calculate the standard deviation for each column

sds <- apply(holder,2,sd)

if (hist==T) {

mfrow=c(1,1)

frame()

if (nest<= 3) par(mfrow=c(nest,1))

if ((nest> 3)&(nest<= 6)) par(mfrow=c(3,2))

for (j in 1:nest) hist(holder[,j], main=paste("Estimates of ", names(original.est)[j]), xlab="")

}

print(rbind(original.est,sds))

return(list(estimates=holder, sds=sds))

}

# We will define our first function that we would like to bootstrap the standard errors

example1 <- function(BSdata) lm(y~x1+x2, data=BSdata)[[1]]

# The key is that we need the function to return a vector to the boostrap command. Thus the subscript [[1]] restricts the returned values to only be the estimates. It my take some manipulation to get a single vector out of an estimate.

example1.res <- data.boot(mydata, "example1", 200, hist=T)

# This will run a bootstrap of the estimates (200 times) and make a histogram

# We can see that the standard deviation of our bootstrap estimates are similar to those of our linear model.

summary(lm(y~x1+x2, data=mydata))

# Let's generate some more complex data to test our bootstrapping on

x1 = runif(nobs)

x2 = rnorm(nobs)

x3 = exp(rnorm(nobs))

x4 = log(runif(nobs))

u = rnorm(nobs)

y = -10+2*x1+2*x2+.2*x3+1.4*x4+u*3

mydata2 <- data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4)

# Let's try something a bit more complex, first we will be estimating more coefficients

example2 <- function(BSdata) {

# First save the results of the OLS regression into a

a<- lm(y~x1+x2+x3+x4, data=BSdata)

# Next save the coefficients

coef <- a[[1]]

# Save the r squared

r2 <- summary(a)$r.squared

# Specify a vector to return

return(c(coef,r2=r2))

}

example2(mydata2)

example2.results <- data.boot(mydata, "example2", 200, hist=T)

We can see that we are able to estimate not only standard coefficients but also the standard deviation of any estimator generated from sampling. The extreme draws in r2 is probably due to some of the sampling distributions having very low values (or high values) of the error term by chance in those extreme draws.

## No comments:

## Post a Comment