## Tuesday, December 4, 2012

### Multivariate Least Squares - Multi-Step Estimator and Correlated Explanatory Variables

Stata do file

* In least squares estimation it is very common to make a statement like "conditioning on x1, x2 has an average effect on y of ..."

* What does this mean?

* One answer is, revoming the average correlation between x1 and x2 and the effect x1 has on predicting y, the least squares estimator estimates the effect of x2 on y.

* I am not really sure if this helps.

* Perhaps an example.

* Imagine a continuous explanatory variable x2

clear
set obs 1000

gen x2 = rnormal()

* Now imagine a couple of binary variables x1 correlated with x2.

gen x1 = rbinomial(1, normal(x2))

cor x?

* We can see that the xs have a correlation around .5

* This means in some sense that about 25% (.5^2) of our variation in x2 explained by x1.

* This has very practical implications for regression analysis using x1 and x2.

* Imagine x1 is whether the individual parents have college degrees and x2 is years of personal college education.

* Because x1 and x2 are correlated any estimation using them together can only isolate the individual effect of x1 from x2 in the ways that they move seperatly from each other.

gen y1 = x1 + x2 + rnormal()*2

reg y1 x1 x2, nocon

* We can see that both coefficients are working well.  So, what do I mean by my previous statement.

* In order to see how the effect of estimating x2 is only based on the remaining variation controlling for x1 we can use a two step estimator that is the same as the previous regression.

* First we regress x2 on x1

reg x2 x1, nocon

* Now we take the residuals form the above regression.

predict x2_residual, resid

reg y1 x2_residual, nocon

* The results of the above regression is that the coefficient on the residuals of x2 is exactly the same as that of x2 in the above regression.

* Thus we can see that the result of multivariate analysis is not based on total varition in one x but that variation in a single x which is independent of the variation in other xs.

* All statistical estimation is an inductive exploritory task based on exploiting variation in observable characteristics.

* However, some variables move together with other variables making it difficult to estimate the effect of individual movements in some variables.

* It is worth noting that there is a different two step estimation procedure which is often used in some applications which is quite different from the multivariate OLS regresssion.

reg y1 x1, nocon

predict y1_resid, resid

reg y1_resid x2, nocon

* This estimator only works as well as the OLS multivariate estimator when x1 is independent of x2.  If it is not then this estimator is heavily biased.

* Why?  In the first step it is attributing all of the coviation in y with x1 to x1 (even though some of it is trully x2s).

* In the second stage it is using what is left over of the variation in y as the explanatory variable.

* In this case, this method vastly underestimates the magnitude of the coefficinet of x2 and overestimates the magnitude of the coefficient on x1.