## Thursday, July 18, 2013

* Stata code

* When I first started taking stats there was some discussion between the merits of R2 measures and that of adjusted R2.

* People were concerned that including any additional estimators by definition increased the R2 measure so the need to come up with a measure that did not depend on number of regressors.

* In this small command I generates a dependent variable then generate independent explanatory variables and see what happens to the r2 and adjusted r2 when we increase the number of explanatory variables.
cap program drop R2r
program define R2r
clear
set obs 2' // The second argument is the number of observations
tempvar y
gen y' = rnormal()  // Generate the dependent variable
forv i=1(1)1' {  // Loop from 1 to the number of variables defined as the first argument.
tempvar vi'
gen vi'' = rnormal()
}
reg y' v1'-v1'' // Do the estimation.
end

set seed 1

R2r 2 10000 /// The r-squared is quite small with only two dependent variables
R2r 20 10000 /// The r-squared is much larger

* But we should not take the results of just two simulations lets try this using the simulate command
simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 2 10000
sum
* The R2 is a little less than .2%
* The adjusted R2 is a little less than 0

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 20 10000
sum
* Now the R2 using 20 observations is close to .2%
* The adjusted R2 is very close to zero

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 200 10000
sum
* Almost identical results with the 2 squared on average being around 2%

* Using 1000 observations the r2 is more sensitive
simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 2 1000
sum
* Now the R2 is a little greater than .2%
* The adjusted R2 is little greater than .02%

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 20 1000
sum
* Now the R2 using 20 observations is 2%

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 200 1000
sum
* Now the average R2 squared is greater tha 20%

* Overall, the take away seems to be, only worry about the R2 when the number of observations are low and the number of regressors are large.