## Thursday, November 15, 2012

### Random Coefficients with IFGLS and MLE

Original Code

* In a recent post on Estimating Random Coefficients (http://www.econometricsbysimulation.com/2012/11/estimating-random-coefficients-on-x.html), I proposed the use of Maximum Likelihood Estimation (MLE) as a means of estimating both the heteroskedastic covariance matrix and the variance of the random coefficient matrix.

* A blog read Szabolcs Lorincz responded by posting some very nice code showing how to write an Iterative Feasible Generalized Least Squares Estimator that performed very similarly to that of the MLE estimator except that it ran much faster.

* On the particular seed choice the IFGLS estimtor outperformed the MLE estimator in both speed and precision of the estimate.

* On this post I will evaluate two different estimators.

* 1. The original MLE
* 2. The IFGLS posted by Szabolcs Lorincz

* I will evaluate each of these estimators on four criteria:

* 1. Feasiblility of estimates (are the estimates consistent with the model and or are the estimates estimatable - no failure of convergence)
* 2. Unbiasedness of estimates
* 3. Statistical efficiency of estimates
* 4. Computational efficiency of estimates

* I will define a single command that generates the data we will use as our test of model effectiveness.
cap program  drop gen_data
program define gen_data

* In order to make this more interesting let's allow there to be multiple random coefficients and them to be correlated.

clear
set obs 1000

gen x1 = rnormal()
gen x2 = rnormal()*(2^.5)

* Coefficients on x2 should be twice as easy to estimate because the variance in x2 is twice as large.

* Now, for the random coefficients.

matrix C = (5, 1 , 0 \ 1, 3.5, 0 \ 0, 0, 400)

drawnorm v1 v2 u, cov(C)

* I am allowing for there to be two random coefficients, b1 and b2 which can be correlated and an error u which is uncorrelated with the random coefficients.

* The random coefficients are:

gen b1 = 2 + v1
gen b2 = -2 + v2

gen y = b1*x1 + b2*x2 + u

* In our analysis we will be interested in estimating the average effect x1 on y which is 2, the average effect of x2 on y which is -2, as well as the variances of the random coefficients 5 and 3.5 as well as the covariance between the coefficients.

* In order to do that effectively we will need to think about how to identify the separate parts of the error term.

* Think of y = B1x1 + B2x2 + e
* Where e = v1x1 + v2x2 + u

* The variance of each component is identifiable:
* var(e) = x1*var(v1)*x1 + x2*var(v2)*x2 + var(u) + 2*x1*x2*cov(v1,v2) + 2*x1*cov(v1,u) + 2*x2*cov(v2,u)

* For simplicity (but not for neccessity, let's assume cov(v1,u)=cov(v2,u)=0
* Thus: var(e) =  x1*var(v1)*x1 + x2*var(v2)*x2 + var(u) + 2*x1*x2*cov(v1,v2)

* It is useful therefore to define:
gen x1_2 = x1^2
gen x2_2 = x2^2
gen x12 = x1*x2
gen varu = 1
end
* End the data generating program

* Now let's set up the different estimation options:

********************
* 1. The original MLE (from the previous post)

cap program drop myNormalReg
program define myNormalReg
args lnlk xb sigma2
qui replace lnlk' = -ln(sqrt(sigma2') * sqrt(2*_pi)) - (\$ML_y-xb')^2/(2*sigma2')
end

* Let's see it in action:
set seed 1211
gen_data  // First let's generate the data

* Now let's attempt to estimate:

* First as a matter of reference our OLS model:
reg y x1 x2

* We can see already our coefficient estimates are not too bad.

* However, we are interested in estimating more than the coefficients.

ml model lf myNormalReg (y = x1 x2) (sigma2: varu x1_2 x2_2 x12, noconstant)
ml maximize

* We can see that our coefficient estimates on the main effect have not improved substantially.

* In addition, in this run, the estimate on x2_2 is negative which is a violation of our model (variance/covariance has to be PD).

********************
* 2. The IFGLS based on code posted by Szabolcs Lorincz (in a comment on the previous post)

* I will write it as a program for easy repeatability:

cap program drop IFGLS1
program define IFGLS1, rclass

capture drop weight
generate weight=1

* Difference needs to be some real value so as not to create problems on the while step.
local difference=1
local i=0  // Counts the number of iterations

* We need to have itial values for some of our locals
local x1 = 0
local x2 = 0
local _cons = 0
local varv1 = 0
local varv2 = 0
local varu = 0
local cov_v1v2 = 0

* Define some parameters to start the program with.
local tolerance=0.0000001  // One of the conditions of termination is that the difference between parameter iterations in value is less than this number
local maxit=100  // An alternative is that there is this many iterations
local quiet="qui"

* This is the condition under which the program continues to iterate, seeking a better solution.
while (difference'>tolerance' & i'<maxit') {

* Add 1 to the iteration count
local i=i'+1

* Run the regression using the inverse weights (itially they are set to equal to 1)
quiet' reg y x1 x2 [aweight=weight]

* This code calculates the difference between the estimates from the previous regression and the one from the most recent.
local dx1=abs(x1'-_b["x1"])
local dx2=abs(x2'-_b["x2"])
local d_cons=abs(_cons'-_b["_cons"])

* Now save the values from the most recent regression
local x1=_b["x1"]
local x2=_b["x2"]
local _cons=_b["_cons"]

* In order to estimate in a second stage the random coefficients we will use the residuals from the regression as an estimate of e.
capture drop e
predict e, res

capture drop e2
generate e2=e^2

* Remember: var(e) =  x1*var(v1)*x1 + x2*var(v2)*x2 + var(u) + 2*x1*x2*cov(v1,v2)
* And since E(e) = 0, var(e)=E(e^2)

quiet' reg e2 varu x1_2 x2_2 x12, noconstant

* As before, let's calculate the difference in parameter estimates
local dvarv1=abs(varv1'-_b["x1_2"])
local dvarv2=abs(varv2'-_b["x2_2"])
local dvaru=abs(varu'-_b["varu"])
local dcov_v1v2=abs(cov_v1v2'-_b["x12"])

* We will save the variance estimates for later use
local varv1=_b["x1_2"]
local varv2=_b["x2_2"]
local varu=_b["varu"]
local cov_v1v2=_b["x12"]

* The difference is calculating the maximum difference in parameter estimates between iterations.
local difference=max(dx1', dx2', d_cons', dvarv1', dvarv2', dvaru', dcov_v1v2')

* We use the predicted variance to calculate our next stage variance matrix
capture drop fit
predict fit, xb

* In the original code the weights are calculated in the following manner.
* This is somewhat confusing to me because I thought the aweights were based on the inverse of the variance not that standard errors.
* Therefore, I will try with both methods as see which one works better.
* Notice that the way this is done is to ignore this round of estimates if the fit is less than zero.
if "1'"=="" quiet' replace weight=1/(fit^.5) if fit>0 & fit!=.
if "1'"!="" quiet' replace weight=1/fit if fit>0 & fit!=.

* This counts how frequently the fit is less than zero.
quietly count if fit<=0 | fit==.

if (r(N)==0) display "FGLS iteration i', max. abs. difference in parameters: " %7.5f = difference'
if (r(N)>0) display "FGLS iteration i', max. abs. difference in parameters: " %7.5f = difference' ", # of obs. with inconsistent weight: " r(N) " (" 100*r(N)/_N "%)"
} // End while loop

* Now that we have our final looping weights, we can run our coefficient estiamtes oen final time.
reg y x1 x2 [aweight=weight]
return scalar bx1 = _b[x1]
return scalar bx2 = _b[x2]

local cov_v1v2=cov_v1v2'/2

matrix Chat_IFGLS1 = (varv1', cov_v1v2', 0 \ cov_v1v2', varv2', 0 \ 0, 0, varu')
matrix list Chat_IFGLS1
return scalar varv1 = varv1'
return scalar varv2 = varv2'
return scalar varu = varu'
return scalar cov_v1v2 = `cov_v1v2'

end

IFGLS1
* I think I made a mistake when altering the code because it is not converging as nicely as it was before.

********************
* I will design a simulation that repeatedly generates the data and runs both estimators

cap program drop sim1
program define sim1, rclass

gen_data

timer clear 1
timer on 1

ml model lf myNormalReg (y = x1 x2) (sigma2: varu x1_2 x2_2 x12, noconstant)
ml maximize, iter(100)
* I set the maximum number of iterations to 100 (parrellel to that of the IFGLS)

timer off 1

return scalar MLE_Bx1 = [eq1]_b[x1]
return scalar MLE_Bx2 = [eq1]_b[x2]

return scalar MLE_varv1 = [sigma2]_b[x1_2]
return scalar MLE_varv2 = [sigma2]_b[x2_2]
return scalar MLE_cov_v1v2 = [sigma2]_b[x12]/2

timer list 1
return scalar MLE_time = r(t1)

timer clear 2
timer on 2
IFGLS1
timer off 2

return scalar GLS_Bx1 = _b[x1]
return scalar GLS_Bx2 = _b[x2]

return scalar GLS_varv1 = r(varv1)
return scalar GLS_varv2 = r(varv2)
return scalar GLS_cov_v1v2 = r(cov_v1v2)

timer list 2
return scalar GLS_time = r(t2)

* Sigma2 means that I will not take the square root of sigma when calculating weighting matrix
timer clear 3
timer on 3
IFGLS1 sigma2
timer off 3

return scalar GLS2_Bx1 = _b[x1]
return scalar GLS2_Bx2 = _b[x2]

return scalar GLS2_varv1 = r(varv1)
return scalar GLS2_varv2 = r(varv2)
return scalar GLS2_cov_v1v2 = r(cov_v1v2)

timer list 3
return scalar GLS2_time = r(t3)

end

sim1

* Now let's run the simulation:
simulate MLE_Bx1=r(MLE_Bx1) MLE_Bx2=r(MLE_Bx2) MLE_varv1=r(MLE_varv1) MLE_varv2=r(MLE_varv2) MLE_cov_v1v2=r(MLE_cov_v1v2) MLE_time=r(MLE_time) ///
GLS_Bx1=r(GLS_Bx1) GLS_Bx2=r(GLS_Bx2) GLS_varv1=r(GLS_varv1) GLS_varv2=r(GLS_varv2) GLS_cov_v1v2=r(GLS_cov_v1v2) GLS_time=r(GLS_time) ///
GLS2_Bx1=r(GLS2_Bx1) GLS2_Bx2=r(GLS2_Bx2) GLS2_varv1=r(GLS2_varv1) GLS2_varv2=r(GLS2_varv2) GLS2_cov_v1v2=r(GLS2_cov_v1v2) GLS2_time=r(GLS2_time) , reps(5): sim1

* Remember our targets:
* b1 = 2
* b2 = -2
* matrix C = (5, 1 , 0 \ 1, 3.5, 0 \ 0, 0, 400)
* The estimators seem practically identical at this level.
sum

* These first 5 simulations are just a sample run.  We need to simulate at least 100 times to hope to have power enough to evaluate the difference between estimators.

simulate MLE_Bx1=r(MLE_Bx1) MLE_Bx2=r(MLE_Bx2) MLE_varv1=r(MLE_varv1) MLE_varv2=r(MLE_varv2) MLE_cov_v1v2=r(MLE_cov_v1v2) MLE_time=r(MLE_time) ///
GLS_Bx1=r(GLS_Bx1) GLS_Bx2=r(GLS_Bx2) GLS_varv1=r(GLS_varv1) GLS_varv2=r(GLS_varv2) GLS_cov_v1v2=r(GLS_cov_v1v2) GLS_time=r(GLS_time) ///
GLS2_Bx1=r(GLS2_Bx1) GLS2_Bx2=r(GLS2_Bx2) GLS2_varv1=r(GLS2_varv1) GLS2_varv2=r(GLS2_varv2) GLS2_cov_v1v2=r(GLS2_cov_v1v2) GLS2_time=r(GLS2_time) , reps(100): sim1
sum

* 1. Feasiblility of estimates (are the estimates consistent with the model and or are the estimates estimatable - no failure of convergence)

* First let's test the feasibility of the estimates

gen MLEfeasible = (MLE_varv1>0) * (MLE_varv2>0) * (MLE_varv1*MLE_varv1-MLE_cov_v1v2^2>0)
gen GLSfeasible = (GLS_varv1>0) * (GLS_varv2>0) * (GLS_varv1*GLS_varv1-GLS_cov_v1v2^2>0)
gen GLS2feasible = (GLS2_varv1>0) * (GLS2_varv2>0) * (GLS2_varv1*GLS2_varv1-GLS2_cov_v1v2^2>0)

sum *feasible
ttest MLEfeasible=GLSfeasible
* As far as feasibility goes there is no difference in the feasibility of MLE or GLS.
* Both methods produce PD estimates at a rate around 30%

* 2. Unbiasedness of estimates
ttest MLE_Bx1 = GLS_Bx1
ttest GLS_Bx1 = GLS2_Bx1
ttest MLE_Bx2 = GLS_Bx2
ttest GLS_Bx2 = GLS2_Bx2

* There is no statistical significance in the difference between coefficient estimates except MLE x1 with GLS at the 10% level but given the number of t-tests this result is not convincing.

* 2. Unbiasedness of estimates
ttest MLE_varv1 = GLS_varv1
ttest GLS_varv1 = GLS2_varv1
ttest MLE_varv2 = GLS_varv2
ttest GLS_varv2 = GLS2_varv2
* There does not appear to be a difference in the variance estimates between the GLS and MLE.

* However, between GLS and GLS2 the variances of GLS2 are statistically smaller than GLS1 which makes GLS1 better for var1 and worse for var2

* Perhaps if the simulation was run 1000 or 10000 times we could get a more precise estimate of the results.

* 3. Statistical efficiency of estimates
sum *x?

* Just eyeballing it, there does not seem to be a substantial difference in the variances of the estimates either.

* 4. Computational efficiency of estimates
sum *time
ttest MLE_time = GLS_time
* The MLE estimator runs statistically significantly faster than the GLS (in this set up at least)
ttest GLS_time = GLS2_time
* GLS runs faster than GLS2 which indicates to me that the original weight adjustment was better than GLS2.

*******************
* Overall

* Both estimators seem to work pretty well at generating unbiased estimates.  The MLE estimator seems to be faster but that could be a byproduct of the sample size and simulation characteristics.

* However, the primary criteria was the ability to generate estimates that were feasible soluations to the problem (that is producing a variance matrix that was positive definite)

* All estimators failed around 70% of the time in this simulation where the correlation between random coefficients is only 24% (rho=1/(5^.5*3.5^.5))

* Thus, I am unhappy with both methods and additional methods will need be investigated.

1. Great work and intersting results, Francis!

I just wanted to mention one thing. It seems to me that, unlike what you reported in the post, your results unambigously show that it's the IFGLS estimator which is the faster:

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
MLE_time | 100 .61314 .9316861 .234 4.61
GLS_time | 100 .19277 .3099843 .031 1.016
GLS2_time | 100 .30592 .3665455 .046 1.109

. ttest MLE_time = GLS_time

Paired t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
MLE_time | 100 .61314 .0931686 .9316861 .4282733 .7980067
GLS_time | 100 .19277 .0309984 .3099843 .1312624 .2542776
---------+--------------------------------------------------------------------
diff | 100 .42037 .0986312 .9863118 .2246643 .6160757
------------------------------------------------------------------------------
mean(diff) = mean(MLE_time - GLS_time) t = 4.2620
Ho: mean(diff) = 0 degrees of freedom = 99

Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0
Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

Moreover, the difference in the speed of IFGLS and ML becomes even larger when increasing the sample size (try for example obs=10000). Probably ML would be much faster with analytical gradient and Hessian.

Szabolcs

1. 