Wednesday, September 11, 2013

Classical Measurement Error and Attenuation Bias

* Classical measurement error is when a variable of interest either explanatory or dependent variable has some measurement error independent of its value.

* We can think of this as the noisy scale phenomenon.

* Imagine that you have a remarkably unreliable scale.  

* Every time you stand on it it varies by and average of 10 pounds.

* You know that your weight does not vary by 10 pounds and that every time you get on the scale there is a different value returned.

* If you are clever, you may map out the differences in weights and establish a distribution.

* Then you may establish a decide how accurate you need your weight measurement to be (for instance within 1 pound of the true) at a 95% level.

* This line of reasoning will lead you to specific number of times you must weigh yourself on the scale and take the average of those weights in order to be confident of your weight.

* The formula is fairly straightforward.  We know standard deviation of the measurement is 10 pounds.

* We know the standard error of a mean estimate is sd/root(n)

* Thus we need SE(95% CI) = 1/2 = 10/root(n)

* solve for n: root(n) = 20 or n = 400

* So we need to measure ourselves around 400 times to ensure that the average is within 1 pound of the true 95% of the time.

* Let's see this in action.  First I will design a simulation that generates the values.

cap program drop simME
program define simME

  clear
  set obs `1' // The first argument defines how many draws

 * Let's say the true weight is always 210
  gen weight = 210 + rnormal()*10

  sum weight

end

simME 400
* Chances are the weight is within 1 of 210

* Let's repeate this 2000 times to see how frequently we miss by more than 1
simulate, rep(2000): simME 400

gen miss = 0
replace miss = 1 if mean<209 span="" style="color: #990033;">mean
>211
sum // I get a miss rate of a little less than 5% which is basically perfect!

* Changing our sample size to 200 will affect our confidence in the outcome
simulate, rep(2000): simME 200
gen miss = 0
replace miss = 1 if mean<209 span="" style="color: #990033;">mean
>211sum // Now the miss rate is about 15%

* Before putting moving on to think about how measurement error can affect our estimates, reflect for a moment on the similarities between statistical hypothesis testing and sampling error and controlling for measurement error by repeatedly taking measures.

* In statistical modeling, we have an underlying model for which we would like to understand how well our parameters fit.

* We would like to just look at one or two observations in order to draw conclusions but we assume there is sampling error and in order to minimize that sampling error we draw more observations.

* We assume an underlying distribution of sampling error and use estimates of the parameters of that distribution to estimate how much confidence we can have in our estimated outcome.

* This variance in estimated outcome is really very similar to the idea of measurement error.

* After some thought I think the two proceedures are exactly identical with different names!


* So enough about that! Let's see how measurement error affects our estimates.

* First let's assume we are trying to model weight gain among cattle and we are using our noisy scale to measure the outcome variable.

* Let's say that there is some linear relationship between calories consumed and weight.

* Weight = calories*B + u

* Let's think that there is some measurement error so that our observed weight (Weight') is equal to true weight (Weight) plus the measurement error (v).

* Weight' = Weight + v

* So we need substitute our observed Weight' into our model:

* Weight' = Weight + v = calories*B + u

* Weight = calories*B + u - v

* If measurement error (v) is uncorrelated with Weight and calories then measurement error of the dependent variable which just cause it to have less precision in the estimator since the combined error e=u+v (assuming that v is not negatively correlated with u).

* Let's see this:


cap program drop simME2
program define simME2
 * First argument is number of observations
 * Second argument is measurement error in the dependent variable

  clear
  set obs `1' // The first argument defines how many draws

  gen calories = rnormal()^2*10

  gen u = rnormal()*10

  gen v = rnormal()*`2'

 * Let's say the true weight is always 210
  gen weight = 200 + calories + u + v

  reg weight calories

end

* First with no measurement error
simulate, rep(2000): simME2 100 0
sum

simulate, rep(2000): simME2 100 10
* We can see that there is no bias introduced by measurement error. Only less precision in estimates (larger standard deviation).
sum

* Thus it does not change the fundamental model that our outcome variable is hard to measure, it only diminishes our ability to detect real effects from the changes.

* Now let's look at a slightly more interesting case, when the explanatory variable has some measurement error.

* Let's think once again we are measuring weight at 6 months and that is a predictor of sale price of cattle at 12 months.

* the model we want to estimate is: price = Weight*3 + u

* Weight is once again measured noisily (Weight').

* price = Weight'*3 + u
* price = (Weight+v)*3 + u
* price = Weight'3 + v*3 + u

* Though this might look like it does not present a problem, it does.

* Now the explanatory variable Weight' is correlated with the error term e=v*3+u since cov(Weight',v) = var(v) > 0

* If we right OLS in terms of covariance terms we get:

* Bhat=cov(Weight',price)/var(Weight')
* Bhat=[cov(Weight, price) + cov(v,price)]/var(Weight')
* Substituting in the true price model:
* Bhat=[cov(Weight, Weight + u) + cov(v,Weight + u)]/var(Weight')
* Bhat=cov(Weight, Weight + u)/var(Weight') + cov(v,Weight + u)/var(Weight')

* We know that cov(v,Weight + u) = 0 so the second term is zero so what is the problem?
* The problem lies in: Bhat=cov(Weight, Weight + u)/var(Weight')=cov(Weight, Weight + u)/(var(Weight)+Var(v))

* Given var(v)>0 : |Bhat| < |B|
* Thus attenuation bias!

* Let's see it in action!

cap program drop simME3
program define simME3
 * First argument is number of observations
 * Second argument is measurement error in the dependent variable

  clear
  set obs `1' // The first argument defines how many draws

  gen weight = rnormal()^2*10
 
  gen v = rnormal()*`2'

 * Generate the observed weight
  gen oweight = weight + v

  gen u = rnormal()*10
 
 * The real pridictor of price is weight not observed weight.
  gen price = 3*weight + u

  reg price oweight

end

* First with no measurement error and no problems
simulate, rep(2000): simME3 100 0
sum

simulate, rep(2000): simME3 100 10
* We can see that there is no bias introduced by measurement error. Only less precision in estimates (larger standard deviation).
sum

* We can see there is now a strong bias towards zero in our estimates.

Formatted By Econometrics by Simulation

No comments:

Post a Comment