Wednesday, July 4, 2012

The problem of near multicollinearity

* This simulation looks at the problem that happens when the variable of interest is correlated with other explanatory variables.  By not including the other variables you may bias the results but by including them you may absorb so much of the variation in the explanatory variable that you may be unable to identify the true coefficient of interest.

* Clear the old data from memory

* Set the number of observations to generate to 1000
set obs 1000

set seed 10

* Generate a positive explanatory variable.
gen x = abs(rnormal())*3

* Imagine we are interested in the coefficient on x.

* Now create correlated explanatory variables
gen z1 = x^2 + rnormal()*10
gen z2 = x^1.75 + rnormal()*10
gen z3 = x^.5 + rnormal()*10 + z2/4
gen y  = 4*x + .5*z1 + .8*z2 + z3 + rnormal()*100

reg y x
* The problem with near-mutlitcolinearity is that when you do not include other correlated explanatory variables it can heavily bias the one that is included.

reg y x z1 z2 z3
* But then you do include them they can absorb so much of the variation that you have no help of identifying the true effect of the variables of interest (x).

corr x z1 z2 z3

* Let's use the Farrar-Glauber Multicollinearity Tests user written Stata command by Emad Abd Elmessih Shehata.

* The ado file can be found at (

* However, it can also be installed via the command ssc install fgtest

fgtest y x z1 z2 z3
* All of the variables appear to be multicollinear (unsuprisingly).

* Thus we can see the Farrar-Glauber test is working well.


  1. Francis: Strictly speaking, multicollinearity doesn't BIAS the OLS coefficient estimates.

    1. I think he meant omited variable bias if you let out some variables to avoid the multicollinearity problem

    2. Fine - that's a different matter.

  2. Yes, that is what I meant. I was thinking, okay well what is the problem with near multicollinearity? Of course it is a problem of identification due to the correlation between the explanatory variable and other explanatory variables.

    But, if that correlation exists then we have an added problem if we seek what may seem the natural fix and omit some variables in order to lend significance to the variables we are interested in. Perhaps I am the only one who would think to take such an action, if so then yes they are a very different matter.

  3. How are you calculating the level of multicollinearity? I know you are doing it with this section of the code:

    gen z1 = x^2 + rnormal()*10
    gen z2 = x^1.75 + rnormal()*10
    gen z3 = x^.5 + rnormal()*10 + z2/4
    gen y = 4*x + .5*z1 + .8*z2 + z3 + rnormal()*100

    But where are those exponent numbers coming from? Are these just randomly chosen numbers and functions? Why exponentiate? I am trying to figure out how you can specify the level of collinearity among a set of variables so that I can compare their errors that I will store in a vector.

    1. I am not sure if I have a very satisfying answer. I was just trying to come up with functions that were not linearly dependent yet generally correlated. Exponentiation seemed like the easiest method.