## Friday, June 22, 2012

### Linear Probability Model (LPM) under misclassified dependent variables

* Linear Probability Model (LPM) under misclassified dependent variables

clear
set obs 10000

gen x = rnormal()
label var x "Explanatory variable"

gen u = rnormal()
label var u "error: Unexplained variation"

* I want y_prob to be centered at .5 so that when I draw the bernoulli draw conditional upon x and u it is lease likely to cause censorship and make the coefficients difficult to interpret.  If x was .2 then there would potentially be a large number of probabilities greater than 1 or less than 0 which would make the estimates naturally attenuated since the true change in probability cannot make probability exceed 1 or drop below 0 since it is naturally bounded.
gen y_prob = .5 + .075*x + .075*u

gen y_ideal = rbinomial(1,y_prob)

* This is the idea y values where there is no misclassification
reg y_idea x

* We can see we recover the coefficient on x quite well despite the binary nature of y_ideal

* Now let's imagine that we do not observe y_ideal but rather a noisy measure of it.
* 10% of our observations have misclassified the y values.
gen misclassified = rbinomial(1,.1)

* The following makes the observed y equal to the ideal when misclassification is not present.
gen y_observed = y_ideal if misclassified==0
* When misclassification is present the modular function works quite well at changing the binary values appropriately.
replace y_observed = mod(y_ideal+1,2) if misclassified==1

reg y_observed x
* We can see that the attenuation bias typical of measurement error in Y is present.  The estimates on x are biased towards zero.  An interesting thing about this particular scenario is that we can easily identify when and how we expect the attenuation bias to go.  As the misclassification approaches .5 then we expect all of the coefficients (which are uncorrelated with the missclassification) to also approach zero.  If the unlikely case where to happen that the misclassification were to become greater than .5 then we would expect the signs on the coefficients to change to the opposite direction in the estimates.

gen misclassified2 = rbinomial(1,.5)
gen y_observed2 = y_ideal if misclassified2==0
replace y_observed2 = mod(y_ideal+1,2) if misclassified2==1
reg y_observed2 x

gen misclassified3 = rbinomial(1,.85)
gen y_observed3 = y_ideal if misclassified3==0
replace y_observed3 = mod(y_ideal+1,2) if misclassified3==1
reg y_observed3 x

* The question then is, do we expect the probit/logit models to behave better than the LPM?
* See: http://davegiles.blogspot.com/2012/06/yet-another-reason-for-avoiding-linear.html

* Dave references the following paper:
* Hausman, J. A., J. Abrevaya & F. M. Scott-Morton, 1998. Misclassification of the dependent variable in a discrete-response setting. Journal of Econometrics, 87, 239-269. probit y_observed x
margins, dydx(*)
* Looking at the average partial effect it seems to me that the two methods yield nearly identical results.

* dprobit also presents a potential method though I am less comfortable with it than I am with the average partial effect since the APE seems to me to better represent the characteristics of the underlying population.
dprobit y_observed x

* Ultimately though I find myself more and more in favor of using the LPM if all that you are looking for is the average partial effect.  However, MLE methods are more flexible in general and allow the effect of x to vary by population characteristics which can be good if specification is know and I cannot help but think are hazardous if specification is unknown.  Ultimately though probit is a more complex procedure and ultimately tends to yield very similar results to the LPM.

* Thus in that absence of information I am inclined to use LPM over probit.  I know that this can cause the violation of underlying theories, but if the estimates are just as good (or just as bad) as the estimates from other procedures then I prefer to use the method that is easiest to do and write.