* Hot decking is a method commonly used in statistics to imput values where missing data is present.
* Let's see how it works!
* Imagine we have a data set of 200,000 people.
* Some of the questions were not answered by those people but most people did answer the majority of the questions.
set seed 1010
clear
set obs 200000
gen male = rbinomial(1,.51)
gen age = rpoisson(40)
gen education = rpoisson(12)
gen legality = rbinomial(1,.15)
gen parents_assets = rpoisson(4)
gen social_network = rbinomial(4, .1)
* Number of close friends
gen race = ceil(runiform()*4)
* Let's say there are 4 "races" of approximately equal representation
* Note that if we are hot decking across all of these characteristics then there are going to be a number of potential hot decks equivalent to the number of options from these different variables ie 2 for gender 2 for legality 4 for social networks and 4 for race so 2*2*4*4=64 without taking into account age which has a mean of 40 and variance of 40. Likewise parents_assets with a mean of 4 and variance of 4 thus sd of 2. Since the poisson distribution does not have an upper limit: age, education, and parents assets could present a problem but in practice there is a low probability of draws larger or smaller than 2 standard deviations as the poisson begins to look very much like a discrete normal distribution as k gets large.
* Now let's have some prediction variable
gen u = 2*rpoisson(40)
gen earnings = male + .01*age + .3*education + .1*social_network + legality + parents_assets + race + u
* The ideal case would be if we could to the OLS
reg earnings male age education social_network legality parents_assets race
/*
Source | SS df MS Number of obs = 127112
-------------+------------------------------ F( 7,127104) = 772.07
Model | 867810.062 7 123972.866 Prob > F = 0.0000
Residual | 20409480.7127104 160.57308 R-squared = 0.0408
-------------+------------------------------ Adj R-squared = 0.0407
Total | 21277290.8127111 167.39142 Root MSE = 12.672
------------------------------------------------------------------------------
earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | .8483197 .0711003 11.93 0.000 .7089644 .987675
age | .0118882 .0056213 2.11 0.034 .0008705 .022906
education | .315772 .0102739 30.74 0.000 .2956353 .3359086
social_net~k | .0225761 .0592116 0.38 0.703 -.0934775 .1386298
legality | 1.055723 .0992171 10.64 0.000 .8612589 1.250187
parents_as~s | 1.01567 .0177606 57.19 0.000 .9808596 1.05048
race | .9607688 .031791 30.22 0.000 .8984591 1.023079
_cons | 79.86437 .2833975 281.81 0.000 79.30892 80.41982
------------------------------------------------------------------------------
*/
* However, in actuallity some of our data is missing.
foreach v in male age education legality parents_assets race social_network {
* There is a 1/16 chance of the value being missing
gen miss = rbinomial(1,`=1/16')
replace `v' = . if miss==1
drop miss
}
* One way of handling this would be to do our estimation but by dropping the missing data.
reg earnings male age education social_network legality parents_assets race
* We can see that though our individual explanatory variables only represent 1/6 missing of 100,000 because different people have different values missing accross all of our explanatory variables we have a supstantial drop in the number of observations because few people are not missing answering at least one of the questions.
* So we, are going to try to impute our missing values to strengthen our estimation power.
* First let us install a user written command for hot decking in Stata (http://ideas.repec.org/c/boc/bocode/s366901.html):
* We will need to temporary change stata's missing values
sum
recode male age education legality parents_assets race social_network (.=-9999)
* This is a somewhat tricky bit of code that I believe is working correctly but I easily may be mistaken.
foreach v in male age education legality parents_assets race social_network {
* I want to create a list of variables that is absent of the current looping variable
local byvars = subinstr("male age education legality parents_assets race social_network", "`v'", "",.)
* Create a initial group
qui egen grp = group(`byvars')
qui sum grp
di "For variable `v' we have " r(max) " hot decks"
* Now let's generate a variable that indicates how many potential values to choose from in each hotdeck
qui gen missing = 1 if `v' == -9999
* Count the number of missings in each group
bysort grp: egen missing_count = sum(missing)
* Count the number of items in each group
bysort grp: gen all_count = _N
* The hot deck size is the number of observations in the deck less then number of missing observations
bysort grp: gen draw_deck = all_count - missing_count
* Finally we need to figure out were to start counting each group from the greater distribution
sort grp
qui gen n = _n
* This is the best trick I have right now for specifying where the group "starts" in the vertical distribution.
* Ie the minimum of the _n values is the starting position on the vertical list for that group
bysort grp: egen pos_min = min(n)
* Specify which card to replace for each by adding a random draw from available hot decks a draw to the starting place of the group
qui gen replace_card = floor(draw_deck*runiform()) + pos_min if `v' == -9999
* Now we have to make sure our data is arranged properly
sort grp `v'
* Replace our `v' value with the card from the relevant hot deck
qui replace `v' = `v'[replace_card] if `v'==-9999
drop grp missing missing_count all_count draw_deck n pos_min replace_card
}
recode male age education legality parents_assets race social_network (-9999=.)
sum
reg earnings male age education social_network legality parents_assets race
/*
Source | SS df MS Number of obs = 172180
-------------+------------------------------ F( 7,172172) = 918.86
Model | 1033602.53 7 147657.505 Prob > F = 0.0000
Residual | 27667352172172 160.696002 R-squared = 0.0360
-------------+------------------------------ Adj R-squared = 0.0360
Total | 28700954.6172179 166.692538 Root MSE = 12.677
------------------------------------------------------------------------------
earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | .9096129 .0611052 14.89 0.000 .789848 1.029378
age | .0133704 .0050211 2.66 0.008 .0035292 .0232116
education | .3026236 .0091769 32.98 0.000 .2846371 .3206102
social_net~k | .0415584 .0530193 0.78 0.433 -.0623583 .1454752
legality | .97673 .0899224 10.86 0.000 .8004842 1.152976
parents_as~s | .9697 .0158518 61.17 0.000 .9386308 1.000769
race | .9613812 .027394 35.09 0.000 .9076896 1.015073
_cons | 80.19244 .2510821 319.39 0.000 79.70032 80.68456
------------------------------------------------------------------------------
*/
* First thing to notice is that we have increased the number of usable observations to 172K, which is 50% more than that of the first regression.
* If the code is working properly, the reason I believe we have not regained our 200k observations is that many hot decks are only populated by missing values.
* Thus the hot decking algorithm has nothing to work with.
* I have been told that hot decking is different from classical measurement error in that it does not lead to attenuation bias.
* Which is interesting but also problematic.
* Observe the t values and rejection rates in the second estimation compared with the first.
* Uniformly the t-values are getting larger despite the estimates not always getting better.
* This is because some of the regressors are imputed and therefore cannot be trusted in an identical fassion to that of standard exogenous regressors.
No comments:
Post a Comment