## Thursday, January 24, 2013

### Choosing an appropriate level of analysis

do file

* Often times we are confronted with having data that has many potential levels of interest.

* It is not always clear what the best level of analysis is in that case.

* Imagine that you have district average achievement rates on a statewide exam.

* You also have individual characteristics of individual students in your population such as age, gender, socioeconomic status.

* Now your question is, what is the best way to predict district pass rates? (You do not have individual passing values).

* The difficulty is that you dependent variable (pass rates) is on an aggrogate level while your independent variables are on a individual level.

* What should you indist_de as your explanatory variables?
* a. Individual level?
* b. Mean by district or median or some other aggrogate statistic?

* The following simulation will attempt to shed some light on this question though I do not feel I can really resolve it in any way without more thought.

* I will structure the simulation directly into a "program" structure in order to speed up the monte carlo process which I will use to evaluate the estimators.

cap program drop ols_levels
program define ols_levels, rclass

clear

set obs 50
* We have 50 districts

gen dist_id = _n

gen dist_1 = rnormal()
gen dist_2 = rnormal()
* Districts have some continuous observable heterogeniety

expand `3'
* We have parameter 3 # of students in each district

* The first two arguments of the ols_levels command will specify the correlation between the individual chacteristics and the aggrogate characteristics.
gen x1 = rnormal()+dist_1*`1'
gen x2 = rnormal()+dist_2*`2'
* The students each have some individual characteristics which are correlated with the district level heterogeniety.

gen u = rnormal()*5
* Each student has some unobserved achievement shock.

****
* Now the next step is the one I had the most difficulty with.
* We need to convert individual achievement into average achievement.
* That is done by first generating individual achievement:

gen y_ind = dist_1+dist_2+.5*x1-.75*x2+u
* This is the value we would like to observe.
label var y_ind "Individual achievement"

* However, what we instead observe is
bysort dist_id: egen y = mean(y_ind)
label var y "District average achievement"

* Now we ask the question, how do we estimate the effect of our explanatory variables on observed average achievement?

* One suggestion would be to estimate observable explanatory variables on the observable dependent variable:
reg y dist_1 dist_2 x1 x2, cluster(dist_id)
* I cluster at the district id since normally the errors would be correlated on the district level.
return scalar Ax1=_b[x1]
return scalar Ax2=_b[x2]

* Alternatively we could argue that it does not make sense to use disaggrogated explanatory variables when the dependent variable is aggrogated since the error that we might hope to explain through having the additional variance in the explanatory variables cannot be correlated with the disaggrogated explanatory variables.

bysort dist_id: egen x1_mean = mean(x1)
bysort dist_id: egen x2_mean = mean(x2)

reg y dist_1 dist_2 x1_mean x2_mean, cluster(dist_id)

return scalar Bdist_1=_b[dist_1]
return scalar Bdist_2=_b[dist_2]
return scalar Bx1_mean=_b[x1_mean]
return scalar Bx2_mean=_b[x2_mean]

* As a matter of reference let's see what happens when we only include the district level characteristics.
reg y dist_1 dist_2, cluster(dist_id)

return scalar Cdist_1=_b[dist_1]
return scalar Cdist_2=_b[dist_2]
return scalar Cx1=0
return scalar Cx2=0
end

* Run the simulation one time with no correlation between dist_ variables and individual variables.
ols_levels 0 0 500

Bdist_1=r(Bdist_1) Bdist_2=r(Bdist_2) Bx1_mean=r(Bx1_mean) Bx2_mean=r(Bx2_mean) ///
Cdist_1=r(Cdist_1) Cdist_2=r(Cdist_2) Cx1=r(Cx1) Cx2=r(Cx2) ///
,reps(500): ols_levels 0 0 500
sum

/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
Adist_1 |       500    .9988695    .0385498   .8776859   1.122928
Adist_2 |       500    .9993321    .0360491   .8834813   1.105001
Ax1 |       500    .0012883    .0017756  -.0043576   .0063994
Ax2 |       500   -.0016733    .0018259  -.0076018   .0041119
Bdist_1 |       500    .9985228    .0389298   .8760694   1.122633
-------------+--------------------------------------------------------
Bdist_2 |       500    .9993324    .0361465   .8882689   1.105491
Bx1_mean |       500    .5431928    .7500353  -2.483676   2.507016
Bx2_mean |       500   -.6719179    .7527237  -3.020345   2.034019
Cdist_1 |       500    .9988699    .0385515   .8776129   1.122928
Cdist_2 |       500    .9993329    .0360523   .8834765   1.104997
-------------+--------------------------------------------------------
Cx1 |       500           0           0          0          0
Cx2 |       500           0           0          0          0 */

* We can see that though we know that x1 and x2 causes individual achievement, they do not provide a good basis for estimating average achievement.

* However, the mean x1 and mean x2 provide reasonable esimates of the effect of x1 and x2 on achievement.

* Can reason this through by thinking about how the average can be represented.

* y_mean = 1/N * sum(y_ind) = sum(dis_1)/N*b1 + sum(dis_1)/N*b2 + sum(x1)/N*b3 + sum(x2)/N*b4 + sum(u)/N

* When seen this way it seems clear that the average of the explanatory variables should be a better predictor than the variable on the individual level.

* But why does the individual value fail so badly?

* I believe it is because the individual value only succeeds at all because it is correlated with the average explanatory variable.

* We can test this hypothesis:

Bdist_1=r(Bdist_1) Bdist_2=r(Bdist_2) Bx1_mean=r(Bx1_mean) Bx2_mean=r(Bx2_mean) ///
Cdist_1=r(Cdist_1) Cdist_2=r(Cdist_2) Cx1=r(Cx1) Cx2=r(Cx2) ///
,reps(500): ols_levels 0 0 2

sum

/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
Adist_1 |       500    .9988851    .5417397  -.8754451   2.458699
Adist_2 |       500    .9978838    .5054726  -.4637527   2.396384
Ax1 |       500    .2299528    .3559766  -.9725819   1.328953
Ax2 |       500   -.3321666    .3767386  -1.615652   .7381132
Bdist_1 |       500    .9953491    .5473069  -.8368359   2.489341
-------------+--------------------------------------------------------
Bdist_2 |       500    .9980698    .5141795  -.3417804   2.519113
Bx1_mean |       500    .4715293    .7382465  -2.060023   2.356781
Bx2_mean |       500   -.6794988    .7577947  -2.587295   1.302411
Cdist_1 |       500    1.003058     .544771  -.9002677   2.433167
Cdist_2 |       500    .9979357     .510302  -.6022506   2.372237
-------------+--------------------------------------------------------
Cx1 |       500           0           0          0          0
Cx2 |       500           0           0          0          0 */

* We can see that now, while the individual level explanatory variables are not as effective as the district level ones, they provide estimates which are closer to the true parameter than the previous simulation.

* Finally, let's look at what happens when there is correlation between district level values and individual level explanatory variables.

Bdist_1=r(Bdist_1) Bdist_2=r(Bdist_2) Bx1_mean=r(Bx1_mean) Bx2_mean=r(Bx2_mean) ///
Cdist_1=r(Cdist_1) Cdist_2=r(Cdist_2) Cx1=r(Cx1) Cx2=r(Cx2) ///
,reps(500): ols_levels .45 -.45 200

sum

/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
Adist_1 |       500    1.226031    .0521453   1.067398   1.384503
Adist_2 |       500    1.334787    .0542503   1.155632    1.48664
Ax1 |       500    .0021011    .0035268  -.0094846   .0116934
Ax2 |       500   -.0036379    .0034786  -.0146368   .0058956
Bdist_1 |       500    1.030821    .3416801   .1528734   1.992299
-------------+--------------------------------------------------------
Bdist_2 |       500    .9902028    .3403735  -.0965584   2.060089
Bx1_mean |       500    .4359488    .7501554  -1.634525   2.529116
Bx2_mean |       500   -.7689256    .7431586  -3.064452   1.332662
Cdist_1 |       500    1.226977    .0520978    1.06962   1.383977
Cdist_2 |       500    1.336426    .0542272   1.157392   1.485595
-------------+--------------------------------------------------------
Cx1 |       500           0           0          0          0
Cx2 |       500           0           0          0          0 */

* Now we can see the situation has changed.

* Not only are the individual level explanatory variables not providing good estimates but they are actually causing the district level estimates to be biased because they are failing to control for the correlation between district mean explanatory variables and district mean dependent values.