## Wednesday, October 31, 2012

### Clustering Standard Errors - State Panel Data Example

* Imagine that you are trying to evaluate corporate state labor taxes as a predictor of state employment.

* First let's generate our states
clear
set seed 1033

set obs 50

gen state=_n

* Let's generate some starting values for unemployment.
gen base_employment=runiform()*.3

* Let's imagine that there is an annual trend in unemployment for each state.
gen trend=rnormal()*.025

* The policy to cut unemployment is enacted in different states around year 10.
gen policy_start = rpoisson(10)

expand 20

bysort state: gen t=_n

gen policy=(t>policy_start)

gen employment = .01*policy + base_employment + trend*t + rnormal()*.06

* The nieve regression would be to directly estimate the effect of the policy.
reg employment policy

* However, we might be concerned that the sampling is clustered.
* In order to help controlled for correlated errors by cluster we can cluster the standard errors.

* We may be interested in the interclass correlation.
loneway employment state
* This happens to be large.

reg employment policy, cluster(state)
* This substantially increases our standard errors size and results in a failure to reject the null.
* But, in this case we know that there is an effect of the policy, should we still cluster our standard errors?

* The answer is yes, we need to cluster our standard errors.

* To show this I will simulate the data 100 times with the alternative scenario (that the null is true and there is no effect).

cap program drop cluster_test
cap program define cluster_test, rclass
clear
set obs 50
gen state=_n
gen base_employment=runiform()*.3
gen trend=rnormal()*.025
gen policy_start = rpoisson(10)
expand 20
bysort state: gen t=_n
gen policy=(t>policy_start)
gen employment = .00*policy + base_employment + trend*t + rnormal()*.06
* NOTE: Now the policy has no effect.
reg employment policy
local p1 = ttail(e(df_r), abs(_b[policy]/_se[policy]))
return scalar sig1 = (`p1'<.05)
reg employment policy, cluster(state)
local p2 = ttail(e(df_r), abs(_b[policy]/_se[policy]))
return scalar sig2 = (`p2'<.05)
end

simulate sig1=r(sig1) sig2=r(sig2), reps(100): cluster_test
sum

* sig1 is from the regression without clustered standard errors.
* sig2 is from the regression with clustered standard errors.
* We can see that both rejections too frequently reject the null (target is 5%).
* However, the difference between unclustered and clustered is the difference between falsely rejecting the null 56% of the time and 12% of the time.
* You can repeate the simulation above using 500 or 5000 states above.
* The more states you use the closer the type 1 error gets to 5%.
* However, increasing the number of years does not impove the estimates.

* There is one more thing I would like to do with this data so let's generate it once more.
cluster_test

* We may be concerned that our policy was not exogenously given to each state but rather as a product of an endogenous connection between employment and the policy.

* One method to test the exogeniety of the policy to so test if the year before the policy was enacted, if there was any predictive power on unemployment.
gen year_before=policy_start-1