Friday, December 14, 2012

Easy Monte Carlo Sampler Command

R script

# In R I am often times unsure about the easiest way to run a flexible Monte Carlo simulation.
# Ideally, I would like to be able to feed a data generating function into a command with an estimator and get back out interesting values like mean estimates and the standard deviation of estimates.
# Stata has a very easly command to use called "Simulate".
# I am sure others have created these types of command previous but I don't know of them yet so I will write my own.

MC_easy = function(dgp, estimator, obs, reps, save_data_index=0) {
  # dgp (Data Generating Proccess) is the name of a function that creates the data that will be passed to the estimator
  # estimator is the name of a function that will process the data and return a vector of results to be analyzed
  # obs is the number of observations to be created in each spawned data set
  # reps is the number of times to loop through the dgp and estimation routine
  # save data index is a vector of data repetitions that you would like saved

  # I typically start looping programming by manually going through the first loop value before constructing the loop for both debugging and demonstrative purposes.

  # This command will create a a starting sample data set.
  start_data = get(dgp)(obs)
  # This command runs the estimate values on that starting data set.
  MC_results = get(estimator)(start_data)

  # Create an empty list of save data values
  save_data = list()
  # If the first data set generated has been specified as a data set to save then save it as the first entry in the save_data list.
  if (sum(1==save_data_index)>0) save_data[[paste("1",1,sep="")]] = start_data

  for (i in 2:(reps-1)) {
    temp_data = get(dgp)(obs)
    # If this repetition is in the save_data_index then save this data to the list save_data
    if (sum(i==save_data_index)>0) save_data[[paste("i",i,sep="")]] = temp_data
    MC_results=rbind(MC_results, get(estimator)(temp_data))
  # Display the results of the estimations
  MC_results =
  print(rbind(mean=mean(MC_results), sd= sd(MC_results), var= sd(MC_results)^2))

  # If the number of items for which the data was saved is equal to zero then only return the results of the estimation.
  if (sum(save_data_index>0)==0) return(MC_results)

  # If the number of items is greater than zero also return the saved data.
  if (sum(save_data_index>0)>0)  return(list(MC_results,save_data))

#### Done with the Monte Carlo easy simulation command.

# Let's try see it in action.

# Let's define some sample data generating program
endog_OLS_data = function(nobs) {
  # Let's imagine that we are wondering if OLS is unbiased when we have a causal relationship between an unobserved variable z and the observed variables x2 and x3 and the unobserved error e.
  # x4 is an observed variable that has no relationship with any of the other variables.
  z = rnorm(nobs)
  x1 = rnorm(nobs)
  x2 = rnorm(nobs) + z
  x3 = rnorm(nobs) - z
  x4 = rnorm(nobs)

  e = rnorm(nobs)*2 + z

  y = 3 + x1 + x2 + x3 + 0*x4 + e
  data.frame(y, x1, x2, x3, x4)

sample_data = endog_OLS_data(10)
# Everything appears to be working well

# Now let's define a sample estimation
OLS_est = function(temp_data) {
  # Summary grabs important values from the lm command and coef grabs the coefficients from the summary command.
  lm_coef = coef(summary(lm(y~x1+x2+x3+x4, data=temp_data)))

  # I want to reduce lm_coef to a single vector with values for each b, each se, and each t
  lm_b = c(lm_coef[,1])
    names(lm_b) = paste("b",0:(length(lm_b)-1),sep="")
  lm_se = c(lm_coef[,2])
    names(lm_se) = paste("se",0:(length(lm_se)-1),sep="")
  lm_se2 = c(lm_coef[,2])^2
    names(lm_se2) = paste("se2_",0:(length(lm_se2)-1),sep="")

  lm_rej = c(lm_coef[,4])<.1
    names(lm_rej) = paste("rej",0:(length(lm_rej)-1),sep="")
  c(lm_b,lm_se, lm_se2,lm_rej)

# From this is it not obvious which estimates if any are biased.
# Of course our sample size is only 10 so this is no surprise.
# However, even at such a small sample size we can identify bias if we repeat the estimation many times.

# Let's try our MC_easy command.

MC_est = MC_easy(dgp="endog_OLS_data", estimator="OLS_est", obs=10, reps=500, save_data_index=1:5)
# We can see that our mean estimate on the coefficient is close to 3 which is correct.
# While b1 is close to 1 (unbiased hopefully) while b2 is too large and b3 too small and b4 is close to zero which is the true value.
# What is not possible to observe with 500 repetitions but capable of being observed with 5000 is that the standard error is a downward biased estimate of the standard deviation.
# This is a well known fact.  I think even there is a proof that there is no unbaised estimator for the standard error for the OLS estimator.
# However, se^2 is an unbiased estimator of the variance of the coefficients.
# Though the variance of se^2 is large, making it neccessary to have a large number of repetitions before this becomes apparent.
# The final term of interest is the rejection rates.  We should expect that for the coefficients b0-b3 that the rejections rates should be above random chance 10% (since there trully exists an effect) which is what they all ended up being though with b1 and b3 the rejection rate is only slightly above 20% which indicates that our power given our sample size given the effect size of x1 and x3 is pretty small.
# The final note is that though we do not have much power we are at least not overpowered when it comes to overrejecting the null when in fact the null is true.  For b4 the mean rejection rate is close to 10% which is the correct power.

# We can recreate the results table by accessing elements of the MC_est list returned from the function.

# We may also be curious about the medians and maxes.

# Or we may like to check if our data was generated correctly

No comments:

Post a Comment