Wednesday, September 5, 2012

Hot Decking!


* Hot decking is a method commonly used in statistics to imput values where missing data is present.

* Let's see how it works!

* Imagine we have a data set of 200,000 people.

* Some of the questions were not answered by those people but most people did answer the majority of the questions.

set seed 1010

clear
set obs 200000

gen male = rbinomial(1,.51)

gen age = rpoisson(40)

gen education = rpoisson(12)

gen legality = rbinomial(1,.15)

gen parents_assets = rpoisson(4)

gen social_network = rbinomial(4, .1)
* Number of close friends

gen race = ceil(runiform()*4)
* Let's say there are 4 "races" of approximately equal representation

* Note that if we are hot decking across all of these characteristics then there are going to be a number of potential hot decks equivalent to the number of options from these different variables ie 2 for gender 2 for legality 4 for social networks and 4 for race so 2*2*4*4=64 without taking into account age which has a mean of 40 and variance of 40.  Likewise parents_assets with a mean of 4 and variance of 4 thus sd of 2.  Since the poisson distribution does not have an upper limit: age, education, and parents assets could present a problem but in practice there is a low probability of draws larger or smaller than 2 standard deviations as the poisson begins to look very much like a discrete normal distribution as k gets large.

* Now let's have some prediction variable

gen u = 2*rpoisson(40)

gen earnings = male + .01*age + .3*education + .1*social_network + legality + parents_assets + race + u

* The ideal case would be if we could to the OLS

reg earnings male age education social_network legality parents_assets race

/*


      Source |       SS       df       MS              Number of obs =  127112
-------------+------------------------------           F(  7,127104) =  772.07
       Model |  867810.062     7  123972.866           Prob > F      =  0.0000
    Residual |  20409480.7127104   160.57308           R-squared     =  0.0408
-------------+------------------------------           Adj R-squared =  0.0407
       Total |  21277290.8127111   167.39142           Root MSE      =  12.672

------------------------------------------------------------------------------
    earnings |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   .8483197   .0711003    11.93   0.000     .7089644     .987675
         age |   .0118882   .0056213     2.11   0.034     .0008705     .022906
   education |    .315772   .0102739    30.74   0.000     .2956353    .3359086
social_net~k |   .0225761   .0592116     0.38   0.703    -.0934775    .1386298
    legality |   1.055723   .0992171    10.64   0.000     .8612589    1.250187
parents_as~s |    1.01567   .0177606    57.19   0.000     .9808596     1.05048
        race |   .9607688    .031791    30.22   0.000     .8984591    1.023079
       _cons |   79.86437   .2833975   281.81   0.000     79.30892    80.41982
------------------------------------------------------------------------------

*/

* However, in actuallity some of our data is missing.

foreach v in male age education legality parents_assets race social_network {

* There is a 1/16 chance of the value being missing

gen miss = rbinomial(1,`=1/16')

replace `v' = . if miss==1

drop miss

}

* One way of handling this would be to do our estimation but by dropping the missing data.

reg earnings male age education social_network legality parents_assets race

* We can see that though our individual explanatory variables only represent 1/6 missing of 100,000 because different people have different values missing accross all of our explanatory variables we have a supstantial drop in the number of observations because few people are not missing answering at least one of the questions.

* So we, are going to try to impute our missing values to strengthen our estimation power.

* First let us install a user written command for hot decking in Stata (http://ideas.repec.org/c/boc/bocode/s366901.html):

* We will need to temporary change stata's missing values

sum
recode male age education legality parents_assets race social_network (.=-9999)

* This is a somewhat tricky bit of code that I believe is working correctly but I easily may be mistaken.
foreach v in male age education legality parents_assets race social_network {

  * I want to create a list of variables that is absent of the current looping variable
  local byvars = subinstr("male age education legality parents_assets race social_network", "`v'", "",.)

  * Create a initial group
  qui egen grp = group(`byvars')

  qui sum grp

  di "For variable `v' we have " r(max) " hot decks"

  * Now let's generate a variable that indicates how many potential values to choose from in each hotdeck
  qui gen missing = 1 if `v' == -9999

  * Count the number of missings in each group
  bysort grp: egen missing_count = sum(missing)

  * Count the number of items in each group
  bysort grp: gen all_count = _N

  * The hot deck size is the number of observations in the deck less then number of missing observations
  bysort grp: gen draw_deck = all_count - missing_count

  * Finally we need to figure out were to start counting each group from the greater distribution
  sort grp

  qui gen n = _n
  * This is the best trick I have right now for specifying where the group "starts" in the vertical distribution.

  * Ie the minimum of the _n values is the starting position on the vertical list for that group
  bysort grp: egen pos_min = min(n)

  * Specify which card to replace for each by adding a random draw from available hot decks a draw to the starting place of the group
  qui gen replace_card = floor(draw_deck*runiform()) + pos_min if `v' == -9999

  * Now we have to make sure our data is arranged properly
  sort grp `v'

  * Replace our `v' value with the card from the relevant hot deck
  qui replace `v' = `v'[replace_card] if `v'==-9999

  drop grp missing missing_count all_count draw_deck n pos_min replace_card

}

recode male age education legality parents_assets race social_network (-9999=.)

sum

reg earnings male age education social_network legality parents_assets race

/*


      Source |       SS       df       MS              Number of obs =  172180
-------------+------------------------------           F(  7,172172) =  918.86
       Model |  1033602.53     7  147657.505           Prob > F      =  0.0000
    Residual |    27667352172172  160.696002           R-squared     =  0.0360
-------------+------------------------------           Adj R-squared =  0.0360
       Total |  28700954.6172179  166.692538           Root MSE      =  12.677

------------------------------------------------------------------------------
    earnings |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   .9096129   .0611052    14.89   0.000      .789848    1.029378
         age |   .0133704   .0050211     2.66   0.008     .0035292    .0232116
   education |   .3026236   .0091769    32.98   0.000     .2846371    .3206102
social_net~k |   .0415584   .0530193     0.78   0.433    -.0623583    .1454752
    legality |     .97673   .0899224    10.86   0.000     .8004842    1.152976
parents_as~s |      .9697   .0158518    61.17   0.000     .9386308    1.000769
        race |   .9613812    .027394    35.09   0.000     .9076896    1.015073
       _cons |   80.19244   .2510821   319.39   0.000     79.70032    80.68456
------------------------------------------------------------------------------

*/

* First thing to notice is that we have increased the number of usable observations to 172K, which is 50% more than that of the first regression.

* If the code is working properly, the reason I believe we have not regained our 200k observations is that many hot decks are only populated by missing values.

* Thus the hot decking algorithm has nothing to work with.

* I have been told that hot decking is different from classical measurement error in that it does not lead to attenuation bias.

* Which is interesting but also problematic.

* Observe the t values and rejection rates in the second estimation compared with the first.

* Uniformly the t-values are getting larger despite the estimates not always getting better.

* This is because some of the regressors are imputed and therefore cannot be trusted in an identical fassion to that of standard exogenous regressors.

No comments:

Post a Comment