* Hot decking is a method commonly used in statistics to imput values where missing data is present.

* Let's see how it works!

* Imagine we have a data set of 200,000 people.

* Some of the questions were not answered by those people but most people did answer the majority of the questions.

set seed 1010

clear

set obs 200000

gen male = rbinomial(1,.51)

gen age = rpoisson(40)

gen education = rpoisson(12)

gen legality = rbinomial(1,.15)

gen parents_assets = rpoisson(4)

gen social_network = rbinomial(4, .1)

* Number of close friends

gen race = ceil(runiform()*4)

* Let's say there are 4 "races" of approximately equal representation

* Note that if we are hot decking across all of these characteristics then there are going to be a number of potential hot decks equivalent to the number of options from these different variables ie 2 for gender 2 for legality 4 for social networks and 4 for race so 2*2*4*4=64 without taking into account age which has a mean of 40 and variance of 40. Likewise parents_assets with a mean of 4 and variance of 4 thus sd of 2. Since the poisson distribution does not have an upper limit: age, education, and parents assets could present a problem but in practice there is a low probability of draws larger or smaller than 2 standard deviations as the poisson begins to look very much like a discrete normal distribution as k gets large.

* Now let's have some prediction variable

gen u = 2*rpoisson(40)

gen earnings = male + .01*age + .3*education + .1*social_network + legality + parents_assets + race + u

* The ideal case would be if we could to the OLS

reg earnings male age education social_network legality parents_assets race

/*

Source | SS df MS Number of obs = 127112

-------------+------------------------------ F( 7,127104) = 772.07

Model | 867810.062 7 123972.866 Prob > F = 0.0000

Residual | 20409480.7127104 160.57308 R-squared = 0.0408

-------------+------------------------------ Adj R-squared = 0.0407

Total | 21277290.8127111 167.39142 Root MSE = 12.672

------------------------------------------------------------------------------

earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | .8483197 .0711003 11.93 0.000 .7089644 .987675

age | .0118882 .0056213 2.11 0.034 .0008705 .022906

education | .315772 .0102739 30.74 0.000 .2956353 .3359086

social_net~k | .0225761 .0592116 0.38 0.703 -.0934775 .1386298

legality | 1.055723 .0992171 10.64 0.000 .8612589 1.250187

parents_as~s | 1.01567 .0177606 57.19 0.000 .9808596 1.05048

race | .9607688 .031791 30.22 0.000 .8984591 1.023079

_cons | 79.86437 .2833975 281.81 0.000 79.30892 80.41982

------------------------------------------------------------------------------

*/

* However, in actuallity some of our data is missing.

foreach v in male age education legality parents_assets race social_network {

* There is a 1/16 chance of the value being missing

gen miss = rbinomial(1,`=1/16')

replace `v' = . if miss==1

drop miss

}

* One way of handling this would be to do our estimation but by dropping the missing data.

reg earnings male age education social_network legality parents_assets race

* We can see that though our individual explanatory variables only represent 1/6 missing of 100,000 because different people have different values missing accross all of our explanatory variables we have a supstantial drop in the number of observations because few people are not missing answering at least one of the questions.

* So we, are going to try to impute our missing values to strengthen our estimation power.

* First let us install a user written command for hot decking in Stata (http://ideas.repec.org/c/boc/bocode/s366901.html):

* We will need to temporary change stata's missing values

sum

recode male age education legality parents_assets race social_network (.=-9999)

* This is a somewhat tricky bit of code that I believe is working correctly but I easily may be mistaken.

foreach v in male age education legality parents_assets race social_network {

* I want to create a list of variables that is absent of the current looping variable

local byvars = subinstr("male age education legality parents_assets race social_network", "`v'", "",.)

* Create a initial group

qui egen grp = group(`byvars')

qui sum grp

di "For variable `v' we have " r(max) " hot decks"

* Now let's generate a variable that indicates how many potential values to choose from in each hotdeck

qui gen missing = 1 if `v' == -9999

* Count the number of missings in each group

bysort grp: egen missing_count = sum(missing)

* Count the number of items in each group

bysort grp: gen all_count = _N

* The hot deck size is the number of observations in the deck less then number of missing observations

bysort grp: gen draw_deck = all_count - missing_count

* Finally we need to figure out were to start counting each group from the greater distribution

sort grp

qui gen n = _n

* This is the best trick I have right now for specifying where the group "starts" in the vertical distribution.

* Ie the minimum of the _n values is the starting position on the vertical list for that group

bysort grp: egen pos_min = min(n)

* Specify which card to replace for each by adding a random draw from available hot decks a draw to the starting place of the group

qui gen replace_card = floor(draw_deck*runiform()) + pos_min if `v' == -9999

* Now we have to make sure our data is arranged properly

sort grp `v'

* Replace our `v' value with the card from the relevant hot deck

qui replace `v' = `v'[replace_card] if `v'==-9999

drop grp missing missing_count all_count draw_deck n pos_min replace_card

}

recode male age education legality parents_assets race social_network (-9999=.)

sum

reg earnings male age education social_network legality parents_assets race

/*

Source | SS df MS Number of obs = 172180

-------------+------------------------------ F( 7,172172) = 918.86

Model | 1033602.53 7 147657.505 Prob > F = 0.0000

Residual | 27667352172172 160.696002 R-squared = 0.0360

-------------+------------------------------ Adj R-squared = 0.0360

Total | 28700954.6172179 166.692538 Root MSE = 12.677

------------------------------------------------------------------------------

earnings | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

male | .9096129 .0611052 14.89 0.000 .789848 1.029378

age | .0133704 .0050211 2.66 0.008 .0035292 .0232116

education | .3026236 .0091769 32.98 0.000 .2846371 .3206102

social_net~k | .0415584 .0530193 0.78 0.433 -.0623583 .1454752

legality | .97673 .0899224 10.86 0.000 .8004842 1.152976

parents_as~s | .9697 .0158518 61.17 0.000 .9386308 1.000769

race | .9613812 .027394 35.09 0.000 .9076896 1.015073

_cons | 80.19244 .2510821 319.39 0.000 79.70032 80.68456

------------------------------------------------------------------------------

*/

* First thing to notice is that we have increased the number of usable observations to 172K, which is 50% more than that of the first regression.

* If the code is working properly, the reason I believe we have not regained our 200k observations is that many hot decks are only populated by missing values.

* Thus the hot decking algorithm has nothing to work with.

* I have been told that hot decking is different from classical measurement error in that it does not lead to attenuation bias.

* Which is interesting but also problematic.

* Observe the t values and rejection rates in the second estimation compared with the first.

* Uniformly the t-values are getting larger despite the estimates not always getting better.

* This is because some of the regressors are imputed and therefore cannot be trusted in an identical fassion to that of standard exogenous regressors.

## No comments:

## Post a Comment