* It is well known that measurement error causes attenuation bias in regression analysis estimators.
* This fact appeared in the literature as early at Spearman (1904).
* Attenuation bias also known as regression dilution is the phenomenon where coefficient estimates are biased towards zero.
* We can understand this phenomona very intuitively by thinking about what measurement error means.
* It means we do not have a very good measure of what something is.
* Imagine we measure people's weight and height just by watching that person walk by.
* We will assume we know the average weight of people and we make sure our average guess is that weight.
* However, unless we are trained our guess will probably miss the mark at a large frequency.
* Thus, if we want to use our guesses of weight and height as predictors of that person's athletic ability then our estimates will suffer from potentially two problems as a result of our measurement method.
* There is the one previously mentioned, attenuation bias caused from our measures not being exact.
* How, are we going to identify the effect of 5 more pounds or 3 extra inches on athletic performance if we are incapable of accurately gaging the difference between 5 pounds or 3 inches?
* The second potential source of problems is that our errors in measurement might be correlated with our unconsious assessment of the subject's athletic ability.
* That is, perhaps subjects that appear more athletic, we will guess as being taller or weighing less.
* This second issue is much more problematic than attentuation bias.
* It will cause a correlation between our errors and our explanatory variables which causes bias of an unknown form.
* In order to understand why attenuation bias exists remember Beta=cov(x,Y)/var(x) and that the OLS coefficient of BetaHat = cov(X,Y)/var(X).
* Where the observable X = x + v.
* If we assume the error term v is uncorrelated with the outcome variable Y then cov(X,Y)=cov(x,Y)
* However, the var(X) = var(x)+var(v)
* Thus: BetaHat = cov(X,Y)/var(X) = cov(X,Y)/(var(x)+var(v)) = cov(x,Y)/(var(x)+var(v))
* Therefore: |BetaHat| < |Beta| when var(v)>0
* Let's see this in action!
set seed 101
set obs 100000
gen measurement_error = 20*rnormal()
gen weight_observed = true_weight+measurement_error
gen u = rnormal()* 5
corr true_weight weight_observed
* We can see even with measurement error, our estimate of weight is 82% correlated with the true weight.
gen athletic_performance = 10 - .05*true_weight + u
* We expect our estimate of weight to be biased by a factor of alpha where alpha is defined as:
* alpha*|beta| = |BetaHat|
* alpha*|-.05| =|cov(x,Y)/(var(x)+var(v))|=
qui corr true_weight athletic_performance, cov
di r(cov_12)/(30^2 + 20^2)
* = -.03469636
* Thus alpha = 70%
reg athletic_performance weight_observed
di .05 * .7
* Thus we can see the nature of our bias is very predictable under the assumption that the measurement error is uncorrelated with our outcome variable.