## Tuesday, February 12, 2013

### Fractional Response Estimation

* Often times, many questions that we would like to answer take the form of fractional response data.

* What percentage of students by school meet international standards, what percentage of losing military commanders are brought up on war crimes charges, or what percentage of those in different cities have of getting the flu.

* Using standard OLS methods without accounting for the natural upper and lower bound may lead to biased and inconsistent estimates (if there are build ups around the upper or lower bounds).

* This is due to corr(u,y)!=0 by the same reasoning as with binary response, censored response, and truncated response data.

* For example:

* Simulation 1

clear
set obs 1000

gen x=rnormal()
gen u=rnormal()
* Unobserved heterogeniety

gen y=normal(0+6*x+u)
label var y "Observed outcome"
* Such a large coefficient on x will drive most of the observations into the extremes.

hist y, frac title(We can see large buildups at 0 and 1)

reg y x
local B_OLS = _b[x]

* Let's see our predicted outcomes
predict y_hat
label var y_hat "Predicted from OLS"

* One of the problems we realize immediately is that it is hard to discuss bias or consistency when we don't really know that the true expected change in y is as a result of a change in x.

* Fortunately, this is easy to calculate.

* y=NormalCDF(B0 + xB + u)
* dy/dx = B * NormalPDF(B0 + xB + u)    ... by chain rule
* dy/dx = 2 * NormalPDF(-2 + x*2 + u)

gen partial_x_y = 6 * normalden(0+x*6 + u)
* Of course each effect is unique to each observation currently.

* To find the average we just average.

sum partial_x_y
* The mean of the partial_x_y shows us that the OLS estimator is too large.
local B = r(mean)

* To find the "true" linear projection of x on y we need first to identify the B0 coefficient.

sum y
local y_bar = r(mean)

sum x
local x_bar = r(mean)

local B0 = `y_bar'-(`x_bar'*`B')

gen true_XB = `B0' + `B'*x
label var true_XB "Predicted from 'true' LM"

gen linear_u = y-true_XB

corr linear_u x
* We can see that the linear error terms are correlated with x, which is potentially what we expect if x has a positive coefficient.

* This is because as x gets large (approaches infitity for example) then the max(linear_u) must approach 0 since x is driving y to be 1 and u cannot make y larger than 1.

* The argument is identical for as x gets small.  Thus when x is large E(u)<0 e="" is="" small="" u="" when="" while="" x="">0.

* We can see our OLS estimator is about half that of the true (though when the data was more evenly dispursed it seemed the OLS estimator was working well.)

* A greater concern for many might be not only the bias of the estimates but the feasibility of the predictions.

two (scatter y x, sort) (scatter y_hat x, sort) (line true_XB x, sort)

* We can see that the true Average Partial Effect is steeper, being less vulnerable to being adjusted by the outlierers created by Y being bound.

* This is not to say that the true APE is ignoring that for a large portion of the observations changes in x has very little effect on y.

* Only that OLS assumes a particular relationship in the errors which is violated.

* Alternatively we can use a estimator that better fits our data generating process.

glm y x, family(binomial) link(probit) robust

predict y_glm
label var y_glm "Predicted from GLM"

two (scatter y_glm y) (lfit  y_glm y), title("A perfect fit would be a line that goes from 0,0 to 1,1")

* This looks pretty good.

two (scatter y x, sort) (scatter y_hat x, sort) (line y_glm x, sort col(red))
* The fitted data is much more satisfying as well.

* Notice though, that even if you ignored the fractional responses and just looked at the binary responses, a probit model would probably fit pretty well.

* We should be able to estimate the marginal effects with the margins command
margins, dydx(x)
di "True APE = `B'"
di "OLS APE = `B_OLS'"

* Unfortunately, it does not appear that the glm command is doing worse than the OLS estimator at estimating the APEs.

* This is probably because of the model misspecification.  That is, when generating the data, the error term should not be a separate term within the normalCDF.

* However, I am not sure how best to do this otherwise.  If anybody has any ideas I am open to them.

* An alternative might be the following:

* Simulation 2

clear
set obs 1000

gen x=rnormal()

gen yp=normal(3*x+2)

gen ob = rpoisson(100)
label var ob "Number of individuals"

gen y=rbinomial(ob,yp)/ob
label var y "Observed outcome"

* This formulation fits better into our concept of how fractional response data works.

* That is there is some number of individual units (in this case 100) which each are being evaluated for either 0 or 1 outcomes.

* By dividing by 100 we reduce the data to a fractional response problem which could be used to evaluate different sized groups since the fractional reponse maintains the same scale.

hist y, frac title(Fractional Responses Look Very Similar)

* Perhaps a little less build up on the tails.

reg y x
local B_OLS=_b[x]

* Let's see our predicted outcomes
predict y_hat
label var y_hat "Predicted from OLS"

* Now to esimate the partial effect we use the same formulation as before because
* for the binomial E(y|n,p)=np
* But we transformed y=E(y/n|n,p)=np/n=p
* Thus:
* y=NormalCDF(xB)              ... but now we do not have to worry about the error u.
* dy/dx = B * NormalPDF(xB)    ... by chain rule
* dy/dx = 2 * NormalPDF(x*2)

gen partial_x_y = 3 * normalden(x*3+2)

* To find the average we just average.

sum partial_x_y
local B = r(mean)
* We can see that the OLS regressor now is too large.

sum y
local y_bar = r(mean)

sum x
local x_bar = r(mean)

local B0 = `y_bar'-(`x_bar'*`B')

gen true_XB = `B0' + `B'*x
label var true_XB "Predicted from 'true' LM"

gen linear_u = y-true_XB

corr linear_u x

two (scatter y x, sort) (line y_hat x, sort) (line true_XB x, sort)

* The only problem with generating data this way is that our data looks too perfect.
* How do we add greater variability into such a formulation?
* A method many might think of would be to add some kind of fixed effect per observation.
* This would suggest that the individuals within each observation are more or less likely to participate based on some unobserved factor which is unique to that observation.
* This is probably not a stretch to think about: People in a city are more likely to get the flu if the flue is already common in the city, Student are more likely to pass an exam if the school is particularly strong or attracts strong students, etc.
* However, if this is the case then the previous data generating process we discussed is more appropriate since u is in effect an unobserved observation level effect.
* In the extreme, as the number of individuals gets large the binomial formulation converges on having the response be the pdf (by the law of large numbers).

glm y x, family(binomial) link(probit) robust
di "Notice the standard errors are about "  e(dispers_p)^.5 " fraction of the those if there were a single draw of the binomail"
* Which is appropriate since there are one average 100 draws, thus the estimated standard error should be about 100 times too large

predict y_glm

margins, dydx(x)
di "True APE = `B'"
di "OLS APE = `B_OLS'"
* We can see that now the glm estimator is working better.

* A researcher might be instead tempted to estimate total raw responses as a function of x via OLS.

* In order to do this the researcher would not calculate the fraction of responses.

gen yraw = y*ob
hist yraw

* The data does not look like it would be crazy to estimate some kind of truncated regression (tobit for example).

* However, given how we know the data has been generated, this would be a mistake.

* The zeros in the data are not truncated values but rather meant to be zeros.

tobit yraw x, ll(0)
* We can adjust the results to our previous scale by dividing the coefficient by 100.

di "Our censored regression estimate is: " _b[x]/100

* This is larger than our previous estimate and biased because it is making the assumption that if our data were not bottom coded at 0 then we would have actually observed negative values for y.

* This is clearly impossible.

* We can further aggravate our mistake by specifying an upper limit as well.

* However, we will make this mistake looking at the original ratios:
tobit y x, ll(0) ul(1)
predict y_tobit

* This yeilds a large and even more biased estimate.

two (scatter y x) (line y_tobit x)

* Graphically it is easy to see how strong the assumptions would need be to justify use of a censored tobit.

* It is similar to what would have been estimated if we just dropped all values that were equal to 0 or 1.

reg y x if y>0 & y<1 p="">