Econometrics By Simulation: Hot Decking!

Wednesday, September 5, 2012

Hot Decking!

* Hot decking is a method commonly used in statistics to imput values where missing data is present.

* Let's see how it works!

* Imagine we have a data set of 200,000 people.

* Some of the questions were not answered by those people but most people did answer the majority of the questions.

set seed 1010

clear
set obs 200000

gen male = rbinomial(1,.51)

gen age = rpoisson(40)

gen education = rpoisson(12)

gen legality = rbinomial(1,.15)

gen parents_assets = rpoisson(4)

gen social_network = rbinomial(4, .1)
* Number of close friends

gen race = ceil(runiform()*4)
* Let's say there are 4 "races" of approximately equal representation

* Note that if we are hot decking across all of these characteristics then there are going to be a number of potential hot decks equivalent to the number of options from these different variables ie 2 for gender 2 for legality 4 for social networks and 4 for race so 2*2*4*4=64 without taking into account age which has a mean of 40 and variance of 40. Likewise parents_assets with a mean of 4 and variance of 4 thus sd of 2. Since the poisson distribution does not have an upper limit: age, education, and parents assets could present a problem but in practice there is a low probability of draws larger or smaller than 2 standard deviations as the poisson begins to look very much like a discrete normal distribution as k gets large.

* Now let's have some prediction variable

gen u = 2*rpoisson(40)

gen earnings = male + .01*age + .3*education + .1*social_network + legality + parents_assets + race + u

* The ideal case would be if we could to the OLS

reg earnings male age education social_network legality parents_assets race

/*

Source | SS df MS Number of obs = 127112
-------------+------------------------------ F( 7,127104) = 772.07
Model | 867810.062 7 123972.866 Prob > F = 0.0000
Residual | 20409480.7127104 160.57308 R-squared = 0.0408
-------------+------------------------------ Adj R-squared = 0.0407
Total | 21277290.8127111 167.39142 Root MSE = 12.672

------------------------------------------------------------------------------
earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | .8483197 .0711003 11.93 0.000 .7089644 .987675
age | .0118882 .0056213 2.11 0.034 .0008705 .022906
education | .315772 .0102739 30.74 0.000 .2956353 .3359086
social_net~k | .0225761 .0592116 0.38 0.703 -.0934775 .1386298
legality | 1.055723 .0992171 10.64 0.000 .8612589 1.250187
parents_as~s | 1.01567 .0177606 57.19 0.000 .9808596 1.05048
race | .9607688 .031791 30.22 0.000 .8984591 1.023079
_cons | 79.86437 .2833975 281.81 0.000 79.30892 80.41982
------------------------------------------------------------------------------

*/

* However, in actuallity some of our data is missing.

foreach v in male age education legality parents_assets race social_network {

* There is a 1/16 chance of the value being missing

gen miss = rbinomial(1,`=1/16')

replace `v' = . if miss==1

drop miss

}

* One way of handling this would be to do our estimation but by dropping the missing data.

reg earnings male age education social_network legality parents_assets race

* We can see that though our individual explanatory variables only represent 1/6 missing of 100,000 because different people have different values missing accross all of our explanatory variables we have a supstantial drop in the number of observations because few people are not missing answering at least one of the questions.

* So we, are going to try to impute our missing values to strengthen our estimation power.

* First let us install a user written command for hot decking in Stata (http://ideas.repec.org/c/boc/bocode/s366901.html):

* We will need to temporary change stata's missing values

sum
recode male age education legality parents_assets race social_network (.=-9999)

* This is a somewhat tricky bit of code that I believe is working correctly but I easily may be mistaken.
foreach v in male age education legality parents_assets race social_network {

* I want to create a list of variables that is absent of the current looping variable
local byvars = subinstr("male age education legality parents_assets race social_network", "`v'", "",.)

* Create a initial group
qui egen grp = group(`byvars')

qui sum grp

di "For variable `v' we have " r(max) " hot decks"

* Now let's generate a variable that indicates how many potential values to choose from in each hotdeck
qui gen missing = 1 if `v' == -9999

* Count the number of missings in each group
bysort grp: egen missing_count = sum(missing)

* Count the number of items in each group
bysort grp: gen all_count = _N

* The hot deck size is the number of observations in the deck less then number of missing observations
bysort grp: gen draw_deck = all_count - missing_count

* Finally we need to figure out were to start counting each group from the greater distribution
sort grp

qui gen n = _n
* This is the best trick I have right now for specifying where the group "starts" in the vertical distribution.

* Ie the minimum of the _n values is the starting position on the vertical list for that group
bysort grp: egen pos_min = min(n)

* Specify which card to replace for each by adding a random draw from available hot decks a draw to the starting place of the group
qui gen replace_card = floor(draw_deck*runiform()) + pos_min if `v' == -9999

* Now we have to make sure our data is arranged properly
sort grp `v'

* Replace our `v' value with the card from the relevant hot deck
qui replace `v' = `v'[replace_card] if `v'==-9999

drop grp missing missing_count all_count draw_deck n pos_min replace_card

}

recode male age education legality parents_assets race social_network (-9999=.)

sum

reg earnings male age education social_network legality parents_assets race

/*

Source | SS df MS Number of obs = 172180
-------------+------------------------------ F( 7,172172) = 918.86
Model | 1033602.53 7 147657.505 Prob > F = 0.0000
Residual | 27667352172172 160.696002 R-squared = 0.0360
-------------+------------------------------ Adj R-squared = 0.0360
Total | 28700954.6172179 166.692538 Root MSE = 12.677

------------------------------------------------------------------------------
earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | .9096129 .0611052 14.89 0.000 .789848 1.029378
age | .0133704 .0050211 2.66 0.008 .0035292 .0232116
education | .3026236 .0091769 32.98 0.000 .2846371 .3206102
social_net~k | .0415584 .0530193 0.78 0.433 -.0623583 .1454752
legality | .97673 .0899224 10.86 0.000 .8004842 1.152976
parents_as~s | .9697 .0158518 61.17 0.000 .9386308 1.000769
race | .9613812 .027394 35.09 0.000 .9076896 1.015073
_cons | 80.19244 .2510821 319.39 0.000 79.70032 80.68456
------------------------------------------------------------------------------

*/

* First thing to notice is that we have increased the number of usable observations to 172K, which is 50% more than that of the first regression.

* If the code is working properly, the reason I believe we have not regained our 200k observations is that many hot decks are only populated by missing values.

* Thus the hot decking algorithm has nothing to work with.

* I have been told that hot decking is different from classical measurement error in that it does not lead to attenuation bias.

* Which is interesting but also problematic.

* Observe the t values and rejection rates in the second estimation compared with the first.

* Uniformly the t-values are getting larger despite the estimates not always getting better.

* This is because some of the regressors are imputed and therefore cannot be trusted in an identical fassion to that of standard exogenous regressors.

Econometrics By Simulation

Wednesday, September 5, 2012

Hot Decking!

No comments:

Post a Comment

Blog Archive