## Friday, November 9, 2012

### Estimating Random Coefficients on X

Original Code

* A frequent assumption in economics is that the coefficient that is being estimated is constant throughout the population.
* That is, the effect of the exogenous variable is the same for all individuals.
* This assumption is implicit in the notation:

* Y = XB + u
* Because B is assumed to be constant for all individuals.
* However, let's for one second imagine that we are not estimating the constant B, but rather a random variable b.
* Thus: Y = Xb + u

* This immediately presents an obvious problem.

* What do we want to know from this formulation?
* One problem is that for however many observations we have, we have the same number of bs, thus we cannot hope to estimate the individual b values.

* Where most people have taken this model is to say we are interested in two primary things:
* 1. What is the average effect of X on Y?  Ie. E(dY/dx|X) which is really what we are after assuming constant coefficient.
* 2. And how much variance exists in b?

* b = B + v
* Where B is the average cofficient and v is the random component.

* Now to insert this into the estimation equation:

* Y = Xb + u = X(B + v) + u = XB + Xv + u = XB +  e
* Now, we can easily show that given E(v|x)=E(e|X)=0 OLS is unbiased.
* Bhat = (X'X)^-1 X'Y = (X'X)^-1 X'(XB + e) = (X'X)^-1 X'XB + (X'X)^-1 X'e
* Bhat = B + (X'X)^-1 X'e = B + (X'X)^-1 X'(Xv + u) = B + v + (X'X)^-1 X'u
* E(Bhat|x) = E(B + v + (X'X)^-1 X'u|x) = B + E(v|X) +  (X'X)^-1 X'E(u|X) = B

* However, OLS is no longer the most efficient estimator of B any longer (because the homoskedasticity assumption is violated).

clear
set obs 10000
set seed 121

gen x = runiform()*5

* I want v and u be drawn from a multivariate normal which allows for them to be correlated.
matrix C = (9, 2.5 \ 2.5, 23)

drawnorm v u, cov(C)

gen y = 4*x + v*x + u

* We can see OLS works "fine"
reg y x

predict uhat, resid
scatter uhat x, sort title(Heteroskedastic Normally Distributed Error)
* This is a form of heteroskedasticity.

* So, we would like to do both 1 and 2 above.
* We would like to estimate not only the average effect of x on y (what we can accomplish with OLS),
* but also the variance of the error terms which is u and which is v.

* In order to do that we will first define the NormalReg.
* That is the regression MLE form assuming the errors are normally distributed.

cap program drop myNormalReg
program define myNormalReg

* The first argument of any maximum likelihood program is the name of the temporary log likelihood variable created by the stata when the ml procedure is called.
* Each additional argument is an linear "equation" that Stata maximizes by choosing estimators for.
args lnlk xb sigma2
* Thus we have two equations that we are asking Stata to solve.

* The following is the log likelihood value that we would like Stata to maximize.
* In this case it is the log of the normal density asking Stata to choose the optimal mean and standard deviation.
qui replace `lnlk' = -ln(sqrt(`sigma2') * sqrt(2*_pi)) - (\$ML_y-`xb')^2/(2*`sigma2')
* I have opted to program the pdf of the normal in the equation rather than the build in normal pdf command.

* Notice that unlike my previous post on modelling heteroskedasticity (http://www.econometricsbysimulation.com/2012/11/modeling-heteroskedasticity.html)
* I am using sigma squared rather than sigma as the primarily form of the variance to be modelled.
* This choice is not trivial.
end

* The first thing you are probably wondering is "why am I using a model that only has one error term when I really want to learn about two error terms?"

* The answer is that, as seen above, the sum of the error v and u can be expressed in a single form e.

* Thus the variance of e equals: var(e) = var(u) + Var(v)*x^2 + 2*cov(v,u)*x

* This conviently is a linear form that can easily be included in the above equation.

* First we need to generate x2
gen x2 = x^2
gen varu = 1

ml model lf myNormalReg (y = x) (sigma2: varu x2 x, noconstant)
ml maximize

matrix list C

* Looking back at matrix C we can see that our coefficient on varu corresponds well with the variance of u.
* Our coefficient on x2 approximates our variance of v.
* But our coefficient on x is way too large for our covariance terms.
* This is expected, the coefficient on x is supposed to be twice as large as the cov(v,u) from the formulation of the var(e) term.

local varu = [sigma2]_b[varu]
local varv = [sigma2]_b[x2]
local cov_uv = [sigma2]_b[x]/2

matrix Chat = (`varv', `cov_uv' \ `cov_uv', `varu')

matrix list Chat
matrix list C

* In order for the MLE estimator to be efficient in this case we are assuming that both v and u are normally distributed (which they are because we generated them).
* We may also be concerned that the generated covariance maxtrix Chat is not Positive Semi-Definite, a requirement for v and u to be jointly multivariate normally distributed.

* We can check this by attempting to take the inverse:
matrix INVA = inv(Chat)
matrix list INVA

* In this case there was not problem inverting Chat.  But this need not be the case.

* Let's try simulating some outcomes for a matrix a hairs breathe away from not being PSD.

cap program drop PDcheck
program define PDcheck, rclass
clear
set obs 1000

gen x = runiform()*5
matrix C = (2, 4 \ 4, 11)

drawnorm v u, cov(C)

gen y = 4*x + v*x + u
gen x2 = x^2
gen varu = 1

ml model lf `1' (y = x) (sigma2: varu x2 x, noconstant)
ml maximize

return scalar beta_coef = [eq1]_b[x]

local varu = [sigma2]_b[varu]
return scalar varu = `varu'
local varv = [sigma2]_b[x2]
return scalar varv = `varv'
local cov_uv = [sigma2]_b[x]/2
return scalar cov_uv = `cov_uv'

local PD = 0
if (`varu'*`varv' - `cov_uv'^2 >0)&(`varu'>0)&(`varv'>0) local PD = 1
return scalar PD=`PD'

end

PDcheck myNormalReg
return list

simulate PD=r(PD) beta_coef=r(beta_coef) varu=r(varu) ///
varv=r(varv) cov_uv=r(cov_uv), rep(50): PDcheck myNormalReg
sum
* We can see that in this formulation only about half the time is the covariance matrix PD.

* This could present a problem.
* There are various ways of trying to correct for this.
* All of them require more advanced programming than I currently understand using the MLE syntax.
* I hope/need to learn how to make such corrections in the near future.

* Note, Stata is already doing some clever behind the scenes manipulations to make sure only feasible values of the parameters are chosen.

* Stata must ensure that sigma >= 0 for the psd (because it is taking the square root).

cap program drop myNormalReg2
program define myNormalReg2

* The first argument of any maximum likelihood program is the name of the temporary log likelihood variable created by the stata when the ml procedure is called.
* Each additional argument is an linear "equation" that Stata maximizes by choosing estimators for.
args lnlk xb var_u var_v cov_uv

tempvar sigma2 psd_check
gen `sigma2' = (exp(`var_u') + exp(`var_v') + 2*`cov_uv')

gen `psd_check' = (`var_u'*`var_v' - `cov_uv'^2)

qui replace `lnlk' = -ln(sqrt(`sigma2') * sqrt(2*_pi)) - (\$ML_y-`xb')^2/(2*`sigma2') * ///
(1^sqrt(`psd_check')) * (1^sqrt(`var_u'))
* I have opted to program the pdf of the normal in the equation rather than the build in normal pdf command.

* Notice that unlike my previous post on modelling heteroskedasticity (http://www.econometricsbysimulation.com/2012/11/modeling-heteroskedasticity.html)
* I am using sigma squared rather than sigma as the primarily form of the variance to be modelled.
* This choice is not trivial.
end

ml model lf myNormalReg2 (y = x) (var_u: ) (var_v: x2, noconstant) (cov_uv: x, noconstant)
ml maximize
*  Note that because I was able to specify simga directly I have already multiplied by the constant 2 causing the cov estimate to be scaled properly.

* This new specification is guaranteed not to create covariance matrices that are not positive definite.
* However, it frequently fails to converge which means that it is really not a very good algorithm.
* I will continue to experiment with this problem in the near future.

* This next step is using a Cholesky Decomposition to pre-specify the variance matrix (as suggested by my advisor Jeff Wooldridge).

* I will post more on that as I make steps forward.

* Also, if anybody knows any built in commands in Stata that can do this, please post them.

* I have used xtmixed previously to identify random coefficients.

* However, I think the command imposes orthogonality on the random coefficients.

1. Hi Francis,

congratulations to your blog - it's very interesting and useful. I'd like to comment on this post.

1. I think what you have formulated is a Generalized Least Squares model of heteroscedasticity. GLS can be estimated not only by Maximum Likelihood but also by Feasible GLS which starts by OLS and then estimates the error variance from the OLS residuals and estimates a weighted least squares model. When iterated until convergence, FGLS is ML. I copied below a code that compares your ML estimator with the IFGLS method (only the first part can go into this post, I'll copy the second part into the next, just put them into one Stata do file). The advantage of the latter is that it's a (sequence of) linear estimator(s) and it's probably faster. Note that I tried to deal with the possible non-PSD issue by not updating the weighting vector for the problematic observations. (Though this issue never arises in the example - probably a further advantage of the IFGLS.)

2. I think in terms of cross-variances it would be more interesting to think of correlations between random terms on coefficients of different observed regressors. Similarly to, say, discrete choice models like BLP, it might be of interest to see if "preferences" for different "characteristics" (different columns of x when there are more regressors) are correlated or not. This is probably more relevant than the correlation between the random terms and the unobserved error term, as it can tell something about the relationship between the observed variables. Of course, I guess identification depends on whether one has enough variation in those variables.

3. It seems to me that there is a problem in your second estimator (myNormalReg2). In the line you define the variance (gen `sigma2' = (exp(`var_u') + exp(`var_v') + 2*`cov_uv')), the expression is not a function of x and x-square.

I hope you find my comments useful.

Cheers,

Szabolcs Lorincz

************************************************

clear
clear matrix
clear mata
set more off
set mem 3g
set obs 100000
set seed 121

gen x = runiform()*5
matrix C = (9, 2.5 \ 2.5, 23)
drawnorm v u, cov(C)
gen y = 4*x + v*x + u
gen x2 = x^2
gen varu = 1

/* ML estimation */
matrix Chat_ML=J(1,1,0)
matrix dif_ML=J(1,1,0)
timer clear 1
timer on 1
cap program drop myNormalReg
program define myNormalReg
args lnlk xb sigma2
qui replace `lnlk' = -ln(sqrt(`sigma2') * sqrt(2*_pi)) - (\$ML_y-`xb')^2/(2*`sigma2')
end
ml model lf myNormalReg (y = x) (sigma2: varu x2 x, noconstant)
ml maximize
local varu = [sigma2]_b[varu]
local varv = [sigma2]_b[x2]
local cov_uv = [sigma2]_b[x]/2
matrix Chat_ML = (`varv', `cov_uv' \ `cov_uv', `varu')
matrix dif_ML=Chat_ML-C
timer off 1
quietly timer list 1
local t1=r(t1)
local timerm=floor(`t1'/60)
local timers=round(`t1'-floor(`t1'/60)*60)
display "`timerm'm `timers's"

2. Second part of Stata code (ML vs. IFGLS)

/* iterated feasible GLS estimation */
matrix Chat_IFGLS=J(1,1,0)
matrix dif_IFGLS=J(1,1,0)
timer clear 1
timer on 1
local quiet quietly
capture drop weight
generate weight=1
local x=0
local _cons=0
local varv=0
local varu=0
local cov_uv=0
local difference=1
local tolerance=0.0000001
local maxit=1000
local i=0
while (`difference'>`tolerance' & `i'<`maxit') {
local i=`i'+1
`quiet' reg y x [aweight=weight]
local dx=abs(`x'-_b["x"])
local d_cons=abs(`_cons'-_b["_cons"])
local x=_b["x"]
local _cons=_b["_cons"]
capture drop res
predict res, res
capture drop res2
generate res2=res^2
`quiet' reg res2 x2 x varu, noconstant
local dvarv=abs(`varv'-_b["x2"])
local dvaru=abs(`varu'-_b["varu"])
local dcov_uv=abs(`cov_uv'-_b["x"])
local varv=_b["x2"]
local varu=_b["varu"]
local cov_uv=_b["x"]
local difference=max(`dx', `d_cons', `dvarv', `dvaru', `dcov_uv')
capture drop fit
predict fit, xb
`quiet' replace weight=1/(fit^0.5) if fit>0 & fit!=.
quietly count if fit<=0 | fit==.
if (r(N)==0) {
display "FGLS iteration `i', max. abs. difference in parameters: " %7.5f = `difference'
}
if (r(N)>0) {
display "FGLS iteration `i', max. abs. difference in parameters: " %7.5f = `difference' ", # of obs. with inconsistent weight: " r(N) " (" 100*r(N)/_N "%)"
}
}
reg y x [aweight=weight]
local cov_uv=`cov_uv'/2
matrix Chat_IFGLS = (`varv', `cov_uv' \ `cov_uv', `varu')
matrix dif_IFGLS=Chat_IFGLS-C
timer off 1
quietly timer list 1
local t1=r(t1)
local timerm=floor(`t1'/60)
local timers=round(`t1'-floor(`t1'/60)*60)
display "`timerm'm `timers's"

display "True variance-covariance matrix"
matrix list C
display "ML estimate"
matrix list Chat_ML
display "IFGLS estimate"
matrix list Chat_IFGLS
display "Difference of ML estimate from true VC matrix"
matrix list dif_ML
display "Difference of IFGLS estimate from true VC matrix"
matrix list dif_IFGLS

3. Hi Szabolcs Lorincz,

Thanks for this thorough comment.

1. I have run your code and it definitely seems to be working in a superior fashion to the MLE. For some reason I thought that the MLE estimator was supposedly more efficient than the FGLS estimator. However, I seem to be wrong. I will follow up this post with another post simulating the results of both methods to see which one is less biased and has lower variance. It seems clear to me that the FGLS estimator definitely is faster.

2. I think that is a good point about the correlation between the random coefficients of variables.

3. I don't think I am mistaken about including the x and x2 term in myNormalReg2. I included it in the MLE ly syntax in order to maximize compatibility of the code.

ml model lf myNormalReg2 (y = x) (var_u: ) (var_v: x2, noconstant) (cov_uv: x, noconstant)

By specifying directly x2 and x into the above equations it tells Stata to include them in the estimation. The above must be specified or else the MLE would not be able to converge.

4. I have a few more thoughts on the matter.

I don't think IFGLS is more efficient (I know you did not say that but I suggested it) and as for the MLE being GLS, that is not the case.

From Wooldridge's MIT book Chapter 7 he says that GLS is used when the variance-covariance matrix is "known". When it is not known, it can be estimated via methods as you purpose, FGLS. MLE is related. It just states that the variance is unknown and the coefficients are unknown and we are going to use the normality assumption to estimate them together.

5. On second glance, you are right. I have not included the x2 term and the x term properly in myNormalReg2