Wednesday, December 19, 2012

User written functions in R

R Script

# Writing your own functions can make your programs work much more efficiently, decreasing the lines of codes required to accomplish the desired results as well as simultaneously reducing the chance of error through repetative code.

# It is easy to take whole blocks of arguments from one function to the next with the ... syntax.  This is a sort of catch all syntax that can take any number of arguments.  Thus if you do not like the default choices for some functions such as the default of paste to use " " to conjoin different pieces then a simple function that I often include is the following:

# Concise paste function that joins elements automatically together.
p = function(...) paste(..., sep="")
  x=21
  p("For example x=",x)
  # This function is particularly useful when using the get or assign commands since they take text identifiers of object names.
  a21 = 230
  get(p("a",x))

# Print is a useful function for returning feeback to the user.
# Unfortunately this function only takes one argument.  By conjoining it with my new paste function I can easily make it take multiple arguments.
# Concise paste and print function
pp = function(...) print(p(...))
  # Print displays text.
  for (i in seq(1,17,3)) pp("i is equal to ",i)

# Round to nearest x
# Functions can also take any number of specific arguments.
# These arguments are either targeted by use of the order arguments are placed in or by specific references.
# If an argument is left blank then the default for that argument is used when available or an error message is returned.
round2 = function(num, x=1) x*round(num/x)
  # This rounding function is programmed to function similar to Stata's round function which is more flexible than the built in round command in R.
  # In R you specify the number of digits to round to.
  # In Stata however you specify a number to round.
  # Thus in Stata round(z,.1) is the same in R as round(z,1).
  # However, the Stata command is more general than the R one since Stata for instance could round to the nearest .25 while the R command would need to be manipulated to accomplish the same goal.
  # I have therefore written round2 which rounds instead to the nearest x.
  round(1.123, 2)
  round2(1.123, .01)
  # Yeild the same result. Yet, round2 will work with the following values.
  # Order is not neccessary.  The following produces identical results.
  round2(x=.01,num=1.123)
  round2(123, 20)
  # The round2 has a default x of 1 so round2 can be used quickly to round to the nearest whole number.
  round2(23.1)
  # The original round has the same default.

# Using modular arithmatic is often times quite helpful in generating data.
# The following function reduces a number to the lowest positive integer in the modular arithmatic.
mod = function(num, modulo) num - floor(num/modulo)*modulo
  # This syntax is programmed in a similar manner to Stata's mod command.
  mod(15,12)
  # This kind of a 12 modular system is the kind of system used for hours.
  # Thus mod(15,12) is the same as
  mod(3,12)
  # Or even:
  mod(-9,12)
  # There is a built in operator that does the same thing as this function.
  -9 %% 12

# Check for duplicate adjacent rows.  First row of a duplicate set is ignored but subsequent rows are not.
radjdup = function(...) c(F, apply(...[2:(dim(...)[1]),]==...[1:dim(...)[1]-1,],1,sum)==dim(...)[2])

# Half the data is duplicated.
  example.data = data.frame(id=1:100, x=mod(1:100, 5), y=rep(1:10,each=10))

# The data needs to be sorted for radjdup to work.
  example.order = example.data[order(example.data$x, example.data$y),]

cbind(example.order, duplicate=radjdup(example.order[,2:3]))

# Half the observations should be duplicated. We must sort our data for the adjected row duplicate command to be of any use with the current data.
  example.data = example.data[order(example.data[,1], example.data[,2]),]

# We expect to see half the observations are duplicate flagged.  This is because the first instance of any duplicate is not flagged.
  cbind(example.order, duplicate=radjdup(example.order[,2:3]))

Friday, December 14, 2012

Easy Monte Carlo Sampler Command

R script

# In R I am often times unsure about the easiest way to run a flexible Monte Carlo simulation.
# Ideally, I would like to be able to feed a data generating function into a command with an estimator and get back out interesting values like mean estimates and the standard deviation of estimates.
# Stata has a very easly command to use called "Simulate".
# I am sure others have created these types of command previous but I don't know of them yet so I will write my own.

MC_easy = function(dgp, estimator, obs, reps, save_data_index=0) {
  # dgp (Data Generating Proccess) is the name of a function that creates the data that will be passed to the estimator
  # estimator is the name of a function that will process the data and return a vector of results to be analyzed
  # obs is the number of observations to be created in each spawned data set
  # reps is the number of times to loop through the dgp and estimation routine
  # save data index is a vector of data repetitions that you would like saved

  # I typically start looping programming by manually going through the first loop value before constructing the loop for both debugging and demonstrative purposes.

  # This command will create a a starting sample data set.
  start_data = get(dgp)(obs)
  # This command runs the estimate values on that starting data set.
  MC_results = get(estimator)(start_data)

  # Create an empty list of save data values
  save_data = list()
  # If the first data set generated has been specified as a data set to save then save it as the first entry in the save_data list.
  if (sum(1==save_data_index)>0) save_data[[paste("1",1,sep="")]] = start_data

  for (i in 2:(reps-1)) {
    temp_data = get(dgp)(obs)
 
    # If this repetition is in the save_data_index then save this data to the list save_data
    if (sum(i==save_data_index)>0) save_data[[paste("i",i,sep="")]] = temp_data
 
    MC_results=rbind(MC_results, get(estimator)(temp_data))
  }
  # Display the results of the estimations
  MC_results = as.data.frame(MC_results)
  print(rbind(mean=mean(MC_results), sd= sd(MC_results), var= sd(MC_results)^2))

  # If the number of items for which the data was saved is equal to zero then only return the results of the estimation.
  if (sum(save_data_index>0)==0) return(MC_results)

  # If the number of items is greater than zero also return the saved data.
  if (sum(save_data_index>0)>0)  return(list(MC_results,save_data))
}

#### Done with the Monte Carlo easy simulation command.

# Let's try see it in action.


# Let's define some sample data generating program
endog_OLS_data = function(nobs) {
  # Let's imagine that we are wondering if OLS is unbiased when we have a causal relationship between an unobserved variable z and the observed variables x2 and x3 and the unobserved error e.
  # x4 is an observed variable that has no relationship with any of the other variables.
  z = rnorm(nobs)
  x1 = rnorm(nobs)
  x2 = rnorm(nobs) + z
  x3 = rnorm(nobs) - z
  x4 = rnorm(nobs)

  e = rnorm(nobs)*2 + z

  y = 3 + x1 + x2 + x3 + 0*x4 + e
  data.frame(y, x1, x2, x3, x4)
}

sample_data = endog_OLS_data(10)
sample_data
# Everything appears to be working well

# Now let's define a sample estimation
OLS_est = function(temp_data) {
  # Summary grabs important values from the lm command and coef grabs the coefficients from the summary command.
  lm_coef = coef(summary(lm(y~x1+x2+x3+x4, data=temp_data)))

  # I want to reduce lm_coef to a single vector with values for each b, each se, and each t
  lm_b = c(lm_coef[,1])
    names(lm_b) = paste("b",0:(length(lm_b)-1),sep="")
  lm_se = c(lm_coef[,2])
    names(lm_se) = paste("se",0:(length(lm_se)-1),sep="")
  lm_se2 = c(lm_coef[,2])^2
    names(lm_se2) = paste("se2_",0:(length(lm_se2)-1),sep="")

  lm_rej = c(lm_coef[,4])<.1
    names(lm_rej) = paste("rej",0:(length(lm_rej)-1),sep="")
  c(lm_b,lm_se, lm_se2,lm_rej)
}
OLS_est(sample_data)

# From this is it not obvious which estimates if any are biased.
# Of course our sample size is only 10 so this is no surprise.
# However, even at such a small sample size we can identify bias if we repeat the estimation many times.

# Let's try our MC_easy command.

MC_est = MC_easy(dgp="endog_OLS_data", estimator="OLS_est", obs=10, reps=500, save_data_index=1:5)
# We can see that our mean estimate on the coefficient is close to 3 which is correct.
# While b1 is close to 1 (unbiased hopefully) while b2 is too large and b3 too small and b4 is close to zero which is the true value.
# What is not possible to observe with 500 repetitions but capable of being observed with 5000 is that the standard error is a downward biased estimate of the standard deviation.
# This is a well known fact.  I think even there is a proof that there is no unbaised estimator for the standard error for the OLS estimator.
# However, se^2 is an unbiased estimator of the variance of the coefficients.
# Though the variance of se^2 is large, making it neccessary to have a large number of repetitions before this becomes apparent.
# The final term of interest is the rejection rates.  We should expect that for the coefficients b0-b3 that the rejections rates should be above random chance 10% (since there trully exists an effect) which is what they all ended up being though with b1 and b3 the rejection rate is only slightly above 20% which indicates that our power given our sample size given the effect size of x1 and x3 is pretty small.
# The final note is that though we do not have much power we are at least not overpowered when it comes to overrejecting the null when in fact the null is true.  For b4 the mean rejection rate is close to 10% which is the correct power.

# We can recreate the results table by accessing elements of the MC_est list returned from the function.
rbind(mean=mean(MC_est[[1]]),sd=sd(MC_est[[1]]),var=sd(MC_est[[1]])^2)

# We may also be curious about the medians and maxes.
summary(MC_est[[1]])

# Or we may like to check if our data was generated correctly
MC_est[[2]]

Tuesday, December 11, 2012

Musing on the importance of Standard Errors

Standard errors are really just as important as coefficients when estimating a relationship because they provide a critical component to inference.  Without standard errors coefficients are largely uninteresting.

What does it matter that the coefficient from the estimation is say 7.3 if we do not know the standard error?  If the standard error is say 1 then a 7.3 is quite different from 0, if that is what is important.  However, if the standard error is instead 100, then a coefficient of 7.3 is meaningless (likely) random noise.

Standard errors are interesting because unlike coefficients they alone are interesting without even having coefficient estimates.  They provide a point value which gives us insight into how much we can expect our estimates to vary.

Likewise, standard errors are often much more difficult to calculate than coefficients and even more sensitive to correct specification.  Biased standard errors can have the effect of making an outcome look less likely than it actually is (over-reject the null) or more likely than it actually is (under-reject the null).

A good example is failing to cluster standards errors when appropriate.  Not clustering standard errors when appropriate might be a problem if you had data from twenty different summer camps.  Not clustering data by summer camp but evaluating student outcomes is implicitly arguing that the outcomes of camp goers at the summer camps are independent of each other camp goer.  Clustering at the camp level however allows for each camp to have common shared shock (error) that season.

Monday, December 10, 2012

Checking for differences in estimates by group


Stata do file

* Imagine that you have two groups in your sample and you are wondering if both groups respond the explanatory variables in the same manner.

clear
set obs 1000

gen group=rbinomial(1,.5)

gen x = rnormal()

gen y = 2 + 3*x + rnormal()*2 if group==0
replace y = 1 + 0*x + rnormal()*2 if group==1

* There is several ways to test if the two sets of estimators are the same.

* One method is using interaction dummies.

gen x_group = x*group

* The technique is to include an interaction term in the model for each coefficient that you would like to check if it is equal.

* Group counts as an interaction term between the constant and the group indicator.
* x_group is the interaction between the x variable and the group variable.

reg y group x_group x

* In order to interpret the coefficients in the above regression.

* First set group equal to zero.

* In that case all that is in the regression is: y = 1*b0 + x*b1 which is the estimator for the group 0.

* We can see the constant estimate is close to 2 and the coefficient on x is close to 3 which are our values.

* Now the tricky thing is to interpret the remaining coefficients.

* Because the group coefficient is effectively interacted with the constant then in order to estimate the constant in the group=1 set just add _b[group- to _b[cons].

di "Group 2 constant estimate is " _b[group]+ _b[_cons] " which we can see is close to 1"

* We do a similar procedure with the slope.

di "Group 2 slope estimate is " _b[x]+ _b[x_group] " which we can see is close to 0"

* The nice thing about this formulation is that the coefficients are already included in the estimation.

* Thus we can easily test the hypothesis that both are the same manner.

test x_group=group=0

* If both groups were the same then they would have the same mean (same constants) and respond the same the the treatment x.

* An alternative way of testing each coefficient individually is:

reg y x if group == 0

local b_cong0 = _b[_cons]
local b_xg0 = _b[x]
local se_cong0 = _se[_cons]
local se_xg0 = _se[x]

reg y x if group == 1

* The t-test that they are the same is b[group=0]-b[group=1]=0
* var(b[group=0]-b[group=1])= var(b[group=0]) + var(b[group=1])

local joint_var_cons = `se_cong0'^2 + _se[_cons]^2

* And the t test is (b[group=0]-b[group=1])/(var(b[group=0]) + var(b[group=1]))^.5

di "t_constants = " (`b_cong0'- _b[_cons] )/(`joint_var_cons')^.5

local joint_var_x= `se_xg0'^2 + _se[x]^2

di "t_constants = " (`b_xg0'-_b[x])/`joint_var_x'^.5

* This formation assumes that the estimates vary independently.

* If done properly the t-stats are almost indentical between formulations.

* By the way, please fill out my survey posted on the last post.

Saturday, December 8, 2012

Small Anonymous Research Survey

Dear Reader,

I am experimenting with Google Documents in order to conduct a small survey that I will use to attempt to answer some research questions that I have.

Please consider taking this anonymous short survey and sharing it with your friends.

https://docs.google.com/spreadsheet/viewform?formkey=dGZoWFRvbmMwZXVyRFVmc3daMGpoVXc6MQ

Thank you for your time,
Francis Smart

Friday, December 7, 2012

The Delta method to estimate standard errors from a non-linear transformation

* Stata do file

* The Delta method can be used to estimate the standard errors after a regression estimation.

* Imagine you have some parameter G = (3*b0-b1)*b2^2 = 3*b0*b2^2-b1*b2^2

* Where y = bo + b1*x1 + b2*x2 + u

* The delta method can be used to estimate the standard errors of G.

* The delta method states that var_hat(G)=(dG/db) var(b) (dG/db)

* dG/db is a gradient vector:

* dG/db = [dG/db0, dG/db1, dG/db2]
* dG/db = [b2^2, -b2^2, 2*(b0-b1)*b2]

* var_hat(G) = (3*b2^2)^2 * se(b0)^-2 + (-b2^2)^2 * se(b1)^-2 + (2*(b0-b1)*b2)^2 * se(b2)^-2

[There is an error in the code because I failed to include a covariance term for the coefficients.  Please see the more recent update on the method.]

clear
set obs 1000

gen x1 = rnormal()
gen x2 = rnormal() * 4

global b0 = 1
global b1 = 1.5
global b2 = .3

local true_G = (3*${b0}-${b1})*${b2}^2

di `true_G'

gen y = ${b0} + ${b1}*x1 + ${b2}*x2 + rnormal()*8

reg y x1 x2

* G = (3*b0-b1)*b2^2 = 3*b0*b2^2  - b1*b2^2
local Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2

di "Ghat = `Ghat' is our estimate (true = `true_G')"

* Let's see if we can't use the delta method to derive a standard error.
local var_hatG = (3*_b[x2]^2)^2 * _se[_cons]^2 + (-_b[x2]^2)^2 * _se[x1]^2 + (2*(_b[_cons]-_b[x1])*_b[x2])^2 * _se[x2]^2

di "Standard error estimate is " `var_hatG'^.5

* Alternatively, let us attempt to bootstrap our standard errors.

cap program drop deltaOLS
program define deltaOLS, rclass
  reg y x1 x2
  return scalar Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2
end

bs Ghat=r(Ghat), rep(500): deltaOLS
* The bootstrap standard errors are similar to that of the delta method's standard errors.

cap program drop deltaMonteCarlo
program define deltaMonteCarlo, rclass
  clear
  set obs 1000

  gen x1 = rnormal()
  gen x2 = rnormal() * 4

  gen y = ${b0} + ${b1}*x1 + ${b2}*x2 + rnormal()*8

  reg y x1 x2

  return scalar Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2
end

simulate Ghat=r(Ghat), reps(500): deltaMonteCarlo
sum
* We can see that our estimates of the standard error via the Delta method and bootstap is quite close to the monte carlo estimates on observed standard errors from 500 replications.

Thursday, December 6, 2012

Logistic or Logit or doesn't matter


Stata do file

* I was recently attending a talk in which the presenter talked about logit in contrast with logistic regression when the outcome variable was binary saying logistic was similar.

* This confused me.

* Looking at the equations for logit vs that of logistic I think it is clear that a logit regression is a special case of a logistic regression.

* However, in case there is still doubt.

clear
set obs 100

gen x = rnormal()
* Some explanatory variable

gen y = rbinomial(1, normal(x))
* Some binary response variable

mlogit y x, baseoutcome(0)

logit y x
* We can see the estimates are identical.

Wednesday, December 5, 2012

Path Analysis

Stata do file

* Path analysis is an interesting statistical method that can be used to indentify complex relationships beween variables and an outcome variable.

* As with all statistical methods the modelling framework is essential to derive reasonable results.

* Conviently, I am only interested in simulating data so as usual my data will perfectly conform to the model's specifications.

* Imagine the following model.







* All of the boxes are observable variables.  The arrows indicate the causal direction of the effects.

* There are two exogenous variables: A and D.   These variables are not influenced by any other variables in the model.

* All other variables are endogenous.

* Each of the variables represents a direct effect of a one unit change in one variable on that of the other variable.

* This framework is convient because it allows us to indentify a "total effect" which is a combined result of both the direct and indirect effects of variables on the outcome variable.

* The variable of primary interest in explaining is H.

* The variable G has only a direct effect on H (pHG).

* While the variable C only has an indirect effect on H (pFC*pHF).

* The reason the indirect effect is a product is because C has a pFC effect on F, and F has a pHF effect on H, thus a change in H as a result of a change in C is how much F changes as a result of C and how much that change effects H.

* Variables can have both and indirect and direct effect.

* B for instance has the direct effect: pHB
* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

* The key feature about this particular example is that all of the arrows are one directional.

* Making a great deal of inference possible that otherwise would not be possible.

* Usually we cannot say that when trying to explain H with explanatory variables A through G that A causes B and B causes H.

* However, if we do the work to indentify reasonable pathways then this type of analysis could be quite interesting.

* Let's generate out data.

clear

* Let's imagine 6000 youth in our sample.

set obs 6000

* Let's first specify our effects

* pEA = .3
* pEB = .13

* pHA = .2
* pHB = .2
* pHE = .3
* pHG = 1.1
* pHF = .2

* pBA = .5

* pCB = .2
* pCD = .1

* pGD = .2

* pFC = .76
* pFB = .4

* For B we can calculate our true effects:

* B Direct: pHB = .2

* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Indirect effects:.2*.76*.2 + .13*.3 = .0694

* Total effect: .0694+.2 = .2694

gen A =                rnormal()
gen B = A*.5 +         rnormal()
gen D =                rnormal()
gen C = B*.2 + D*.1 +  rnormal()
gen E = A*.3 + B*.13 + rnormal()
gen F = B*.4 + C*.76 + rnormal()
gen G = D*.2 +         rnormal()
gen H = E*.3 + A*.2 + B*.2 + F*.2 + G*1.1 + rnormal()

* Simualtion Done

* In order to generate our different effects we simply run OLS for each endogenous variable.

reg A B
  local pBA = _b[B]

reg C B D
  local pCB = _b[B]
  local pCD = _b[D]

reg C B D
  local pCB = _b[B]
  local pCD = _b[D]

reg G D
  local pGD = _b[D]

reg F C B
  local pFB = _b[B]
  local pFC = _b[C]

reg E A B
  local pEA = _b[A]
  local pEB = _b[B]

reg H A B E F G
  local pHA = _b[A]
  local pHB = _b[B]
  local pHE = _b[E]
  local pHF = _b[F]
  local pHG = _b[G]

* In order to estimate the indirect effect say of B on H.
* We just plug our estimates into the equation.

* B direct effect: pHB
* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

di "B's estimated indirect effect = `pCB'*`pFC'*`pHF' + `pEB'*`pHE'"
di "B's estimated indirect effect = " `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* Which turns out to be close to our true value.

di "B's total estimated effect on H is " `pHB' + `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* It is possible to use the user written command pathreg to make things easier.

* Install it by typing the following command. findit pathreg
pathreg (H E B F G) (G D) (C B D) (B A) (E A B) (F B C)

* This command does not currently calculate out all of the indirect and direct effects.

* I am not sure the best way to calculate the standard errors of the different effect estimates.

* My guess is that since this is just a series of fast OLS regressions the easiest thing to do would be to boostrap the entire process.

* This would require slightly more code but definitely easy to do from this point.

Tuesday, December 4, 2012

Multivariate Least Squares - Multi-Step Estimator and Correlated Explanatory Variables

Stata do file

* In least squares estimation it is very common to make a statement like "conditioning on x1, x2 has an average effect on y of ..."

* What does this mean?

* One answer is, revoming the average correlation between x1 and x2 and the effect x1 has on predicting y, the least squares estimator estimates the effect of x2 on y.

* I am not really sure if this helps.

* Perhaps an example.

* Imagine a continuous explanatory variable x2

clear
set obs 1000

gen x2 = rnormal()

* Now imagine a couple of binary variables x1 correlated with x2.

gen x1 = rbinomial(1, normal(x2))

cor x?

* We can see that the xs have a correlation around .5

* This means in some sense that about 25% (.5^2) of our variation in x2 explained by x1.

* This has very practical implications for regression analysis using x1 and x2.

* Imagine x1 is whether the individual parents have college degrees and x2 is years of personal college education.

* Because x1 and x2 are correlated any estimation using them together can only isolate the individual effect of x1 from x2 in the ways that they move seperatly from each other.

gen y1 = x1 + x2 + rnormal()*2

reg y1 x1 x2, nocon

* We can see that both coefficients are working well.  So, what do I mean by my previous statement.

* In order to see how the effect of estimating x2 is only based on the remaining variation controlling for x1 we can use a two step estimator that is the same as the previous regression.

* First we regress x2 on x1

reg x2 x1, nocon

* Now we take the residuals form the above regression.

predict x2_residual, resid

reg y1 x2_residual, nocon

* The results of the above regression is that the coefficient on the residuals of x2 is exactly the same as that of x2 in the above regression.

* Thus we can see that the result of multivariate analysis is not based on total varition in one x but that variation in a single x which is independent of the variation in other xs.

* All statistical estimation is an inductive exploritory task based on exploiting variation in observable characteristics.

* However, some variables move together with other variables making it difficult to estimate the effect of individual movements in some variables.

* It is worth noting that there is a different two step estimation procedure which is often used in some applications which is quite different from the multivariate OLS regresssion.

reg y1 x1, nocon

predict y1_resid, resid

reg y1_resid x2, nocon

* This estimator only works as well as the OLS multivariate estimator when x1 is independent of x2.  If it is not then this estimator is heavily biased.

* Why?  In the first step it is attributing all of the coviation in y with x1 to x1 (even though some of it is trully x2s).

* In the second stage it is using what is left over of the variation in y as the explanatory variable.

* In this case, this method vastly underestimates the magnitude of the coefficinet of x2 and overestimates the magnitude of the coefficient on x1.