## Saturday, August 18, 2012

### Inverse Probability Wieghting to Correct for Sample Selection/Missing Data

* Imagine that you have some data set that is missing some of the variables of interest but you have a complete set of explanatory variables.  You might be concerned that the selection from the sample is correlated or is causing correlation in errors with your explanatory variable of interests thus creating potential bias.

* Imagine that you are interested in estimating if obedience school for dogs has the potential to reduce their risk of biting people.  As the y variable you have self-reported (by owners) number of bites.  As the explanatory variable you have breed aggressiveness and an indicator if the dog went to obedience school.

* Imagine also that you have information on the aggressiveness of the owners of the dogs which is correlated with the error in estimating the number of bites.  It is also an explanatory variable of selection.

clear
set obs 100000

gen n_classes = rpoisson(1)
gen breed_agg = rnormal()
gen owner_agg = rnormal()

* First lets calculate selection
gen p = normal(.5 -.5*n_classes + .5*owner_agg)
gen s = rbinomial(1,p)
gen u = rnormal()+owner_agg

* First let's assume there is no selection
gen bites = 2 + breed_agg - n_classes + 3*u

reg bites breed_agg n_classes
* We can see in this case the estimates look good (absent of selection)

replace bites = . if s == 0

reg bites breed_agg n_classes
* Now, the effects of classes seem greatly diminished in our observables because of the correlation between selection and the error component resulting from ownernship aggressiveness.

* There are two ways I can think of generating an unbiased estimates.

* We must do this by removing the correlation with selection and the error.

reg bites breed_agg n_classes owner_agg
* Is the easiest way to do this.  However, this post is about inverse probability weighting.  So that is what we will do.

probit s n_classes owner_agg

predict shat
* We want to estimate probability of selection from observables

gen ishat = 1/shat
* Then first the inverse of it to use in the pweight command

reg bites breed_agg n_classes [pweight=ishat]
* We can see this estimate is working well though the previous regression was generally better.

* This post deals with inverse probability weighting in simple OLS.  A future post will address inverse probability weighting in M-estimation: http://ideas.repec.org/p/ifs/cemmap/11-02.html