Saturday, August 25, 2012

Breaking down R2 by variable




* It is not entirely possible to assign R2 by individual variable within a regression.

* Let's first look at what R2 is

clear
set obs 1000

gen x1=rnormal()
gen x2=rnormal()*(5)
gen x3=rnormal()*(10)

* I want x1 to explain some variation but x2 to explain more than x1 and x3 to explain even more.

gen u=rnormal()*10

gen y = x1 + x2/2 + x3/3 + u


reg y x1 x2 x3

* Now we see that (as expected) x1 has the largest coefficient, however we know x2 and x3 are more important to calculating y.

* It is not clear at all how much of R2 each variable is explaining.

* One interpretation is that it is sum of squares model over sum of squares total.

di "R2 = " e(mss) /(e(rss) + e(mss))

* This is easy to calculate

* SST = sum{across I}(Yi-mean(Y))^2 =  N*mean((Yi-mean(Y))^2)
sum y
gen ydemean2 = (y-r(mean))^2
sum ydemean2
di "SST = " r(mean)*r(N)

* An identical process can be used to find RSS if you were to use yhat rather than y.
predict yhat
sum yhat
gen yhatdemean2 = (yhat-r(mean))^2
sum yhatdemean2
di "SSR = " r(mean)*r(N)

* R2 can also be related to the correlation between yhat and y

corr yhat y

di "R2 = " r(rho)^2

* So can we break apart the R2 into seperate variable components?

* Perhaps

reg y x1
reg y x2
reg y x3

* This works fine if the explainatory variables are independent.  However, if they are not the sum of the separate R2s get larger that the collective R2 rather quickly.

* An alternative approach would be through use of explanatory variables.

reg y x1 x2 x3

forv i = 1/3 {
  gen yhat_x`i' = x`i'*_b[x`i'] /* the constant is unimporant */
  corr yhat_x`i' y
  di "Maximum R2 (by x`i') is " r(rho)^2
}

* Interestingly, this gives identical results

* A final method seems appealing in seperating the explanatory power out of each variable.

* I call it the Partial Variation in Projection (PVP)

* For instance, it may be calculated as
sum x1
local pvp1 = r(sd)*_b[x1]
di "PVP(x1) = " `pvp1'


sum x2
local pvp2 = r(sd)*_b[x2]
di "PVP(x2) = " `pvp2'

sum x3
local pvp3 = r(sd)*_b[x3]
di "PVP(x3) = " `pvp3'

* We can see the PVP value is performing well.  x2 is twice as large as x1 and x3 is about three times that of x1.

* A scaled estimator would be in terms of total explanatory power by variable (portion of R2 by variable)

di "PVP(x1) scaled = " `pvp1' / (`pvp1'+`pvp2'+`pvp3')
di "PVP(x2) scaled = " `pvp2' / (`pvp1'+`pvp2'+`pvp3')
di "PVP(x3) scaled = " `pvp3' / (`pvp1'+`pvp2'+`pvp3')

* Finally the question ends up being, does this method releave useful values?

* Looking back at how y was generated:

* gen x1=rnormal()
* gen x2=rnormal()*(5)
* gen x3=rnormal()*(10)

* gen u=rnormal()*4

* gen y = x1 + x2/2 + x3/3 + u

* The portion x1 + x2/2 + x3/3 is explainable variance.

sum x1
local sd_x1 = r(sd)*1
sum x2
local sd_x2 = r(sd)/2
sum x3
local sd_x3 = r(sd)/3


* Within explainable variance:

di "x1 represents: " `sd_x1'/(`sd_x1'+`sd_x2'+`sd_x3')
di "x2 represents: " `sd_x2'/(`sd_x1'+`sd_x2'+`sd_x3')
di "x3 represents: " `sd_x3'/(`sd_x1'+`sd_x2'+`sd_x3')

* We can see that the coefficient estimates above approximate the true values.

* However this type of analysis of explanatory power by variable might only be useful for the case when explanatory variables are orthoganol.  If explanatory variables were correlated then the variance in one variable is not indepentent of the variance in the other and thus together they will have less explanatory power than two orthoganal explanatory variables, ceteris parabis.

1 comment:

  1. Have you looked at a shorrocks-shapley decomposition? I think it does a similar thing.

    ReplyDelete