## Wednesday, November 14, 2012

### R-squared

Original Code

* R-Squared

* R-squared and pseudo r-squared is a useful statistics produced by most regression type estimation routines.

* R-squared (R2) is a measure of how much of the variance in y is explained by the model.

* Thus a model with only an intercept has an R2 of 0.

set seed 101
clear
set obs 10000

gen y1=rnormal()

reg y1

* While in the opposite extreme a model which does not have any unexplained variance has an r2 of 1.

reg y1 y1, noconstant

* Technically this regression should not work but Stata does the math and produces the results.

* Let's see how well R2 approximates explainable variation.

gen x=rnormal()
gen u=rnormal()

gen y2 = (1)*x + u
* The variance of the model is equal to 1 (from 1^2 * var(x))
* The variance of the unexplained error is equal to 1 (var(u))
* Thus our true explained variance should be equal to var(x)/(1^2 * var(x)+var(u)) = 1/2

reg y2 x
* Thus we can see our R2 estimate of the explained variance is very close to the true which is .5

* I made enphasis on noting the coefficient on the x.

* That coefficient significantly scales explainable variation.

* Thus:
gen y3 = 2*x + u

* Should have a much larger R2 because model variance = 2^2*varx = 4
* Var(u) = 1
* R2 = 4/(4+1)=.8

reg y3 x

* If we were to add multiple xs the calculation is similar though if there is correlation between the xs then that will factor into the model.

gen x1 = rnormal()
gen x2 = rnormal()

gen y4 = x1 + x2 + u

* R2 = var(x1) + var(x2) / (var(x1) + var(x2) + var(u) = 2/3
reg y4 x1 x2

* If there is correlation between the xs then that can substantially throw off the calculations.

* In the extreme cases corr(x1, x2)=1 then we are back to the same scenario as y3

* y = x1 + x2 + u (if x1~N(0,1) and x2~N(0,1)) then x1=x2

* y = 2*x1 + u

* Thus R2 = .8

* In the other extreme corr(x1,x2)=-1

* Then, given that they are both N~(0,1), x2=-x1

* y = x1 + x2 + u = x1 - x1 + u = u

* Which is the same as y1

* R2 = 0

* R-squared can also be thought of as the square of the correlation between the predicted values and the observed.

reg y4 x1 x2
predict y4hat

corr y4hat y4
* Thus we can see that there is an 81% correlation between yhat and y observed.

* A high correlation would indicate that our model have done well at predicting observable characteristics.

di r(rho)^2

* A brief note on adjusted R2.

* R2 is known to always be larger the more variables are in your model.

gen z1 = rnormal()
gen z2 = rnormal()
gen z3 = rnormal()

reg y4 x? z?

* Thus: the R2 moved from   R-squared     =  0.6610 to
*                           R-squared     =  0.6611

* This factor being known researchers have developed the Adj-R2 which slightly penalizes the R2 for including more variables.

* Thus Adj R-squared =  0.6609

* This might be appropriate given known facts, however it is trivial and almost always worth ignoring.

* I generally don't pay attention to the AR2 and I don't know anybody else who does either.

* A .0001 difference in R2 is so unimportant as to be completely ignorable without significant loss of content.