* Imagine that you have some data on an unemployed population. You would like to give job training to some of these workers however you are concerned that those who volunteer to be part of the job training program are fundamentally different than those who did not sign up for the job training.
* We have data on 2000 individuals
set obs 2000
* Educational attainment is known
gen education = rnormal()
gen ability = rnormal()
gen motivation = rnormal()
* Ability and motivation levels are unobservable.
* Thoe who have greater motivation are more likely to sign up for the job training program.
gen p_sign_up = normal(motivation)
* The decision to seek trianing or not is a random one.
gen sign_up = rbinomial(1,p_sign_up)
* If we give training to everybody that signs up then
gen train = sign_up
* Once, workers have recieved training the training has some benefit on expected earnings.
gen u = rnormal()
gen exp_earn = ability + 2*education + motivation + train*.5 + u*2
* Let's see how well we can estimate the returns to training.
reg exp_earn train education
* We can see that because ability (part of the error term) is correlated with assignment to training selection bias into training is causing it to look more effective than it actually is.
* Alternatively, when choosing who to give training to we could instead choose to give it only to a random selection of those who signed up for training.
gen train2 = rbinomial(1,.5) if sign_u==1
replace train2 = 0 if train2 == .
gen exp_earn2 = ability + 2*education + motivation + train2*.5 + u*2
* If we stop there then we actually have not helped anything.
reg exp_earn2 train2 education
* However, if we restrict our estimation to only people who signed up for the training program we can use the randomness of the program to estimate unbiased results.
reg exp_earn2 train2 education if sign_up==1
* The only catch is that now we cannot argue that we have an unbiased estimate of the effect of the training program for the entire population. Only for those that chose to sign up. This is not entirely a bad thing. If we make the program voluntary then it will be based on sign up anyways. Thus selecting based on signing up is not an unreasonable restriction.