Simple way to do a weighted hot deck imputation in Stata?

问题

I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):

proc surveyimpute method=hotdeck(selection=weighted);

For clarity then, the basic requirements are:

Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.
Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1

I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.

global miss_vars "wealth income"
global weight    "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars {
    replace `v' = . if impute == 1
}

Data looks like this:

            id       type     income     wealth     weight     impute
  1.         1          0       5000   20188.03          4          0
  2.         2          0      10000   40288.81          1          0
  3.         3          0          .          .          1          1
  4.         4          1      20000   80350.85          4          0
  5.         5          1      25000   100378.8          1          0
  6.         6          1          .          .          1          1

So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.

For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):

  3.         3          0       5000   20188.03          1          1
  3.         3          0      10000   40288.81          1          1

Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.

回答1:

Here are some brief notes about the community contributed hotdeck routines by Adrian Mander and David Clayton mentioned in the comments above by @PearlySpencer (plus a followup version):

There seem to be a couple versions:

hotdeck.ado (2007) https://ideas.repec.org/c/boc/bocode/s366901.html
whotdeck.ado (2011) https://econpapers.repec.org/software/bocbocode/s433201.htm

As best I can tell both of these are designed to do an Approximate Bayesian Bootstrap which is essentially a multiple-imputation version of a hotdeck. Unfortunately neither of them seems to handle sample (or survey) weights. The second of the two ("whotdeck") does have a parameter for weights but this appears to be for predicting "missingness" and does not have anything to do with sample/survey weights.

The first one ("hotdeck") does at least seem to do a standard hotdeck, so may be used in that way if you don't need weights. The second one ("whotdeck") probably does a simple hotdeck also, but the syntax was a little trickier and I didn't succeed in getting it to do so (which is probably a failure by me and in any event is not to knock it as it seems designed for more complex situations).

I emailed Adrian Mander and he said he doesn't use stackoverflow, but that it would be OK for me to post his email response to my question about using sample/survey weights with hotdeck or whotdeck:

Interesting problem, if the weights are frequency weights then the easiest thing to do is expand freq_weight and then use hotdeck.

It might be able to be done with a single line of code to make it work with other types of weight because currently the imputation is done by randomly ordering the rows of your dataset by generating a random number and then sorting.. with weights you would need to generate random numbers and then probably multiply the weights to the random numbers and then order them (I think this sort of thing would work but this idea has just popped into my head so would need some thinking about).

回答2:

Here's a concise and simple approach that should also be quite fast even for large datasets as it only does 2 sorts and there is nothing else that should be computationally expensive. Here's the code with minimal comments, and further below is the same code but with more extensive comments:

gen sort_order = uniform()

// save recipient rows to file, keep donors
preserve
keep if impute == 1
save recipients, replace
restore
keep if impute == 0

// prep donor cells
sort type sort_order
by type:  gen weight_sum = sum($weight)
by type:  gen impute_weight = $weight / weight_sum[_N]
by type:  replace impute_weight = sum(impute_weight)
drop weight_sum

// bring back recipient rows and sort entire data set    
append using recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order

// replace missing values via a simple replace
foreach v in $miss_vars {
   by type: replace `v' = `v'[_n-1] if impute == 1
}

// extra kludge step necessary to handle top rows
gsort type sort_order
foreach v in $miss_vars {
   by type: replace `v' = `v'[_n-1] if `v' == .
}

This seems to work fine for the test example but I haven't tested on larger and more complicated cases. As noted in the question, I expect this should give the same results as the SAS method:

proc surveyimpute method=hotdeck(selection=weighted);

Note also that if you don't want to use weights, you could just set them to be a column of ones (e.g. gen weight = 1).

And here it the same code, with more comments:

gen sort_order = uniform()

// split off and save the recipient rows
preserve
keep if impute == 1
save recipients, replace

// restore full dataset and keep only donor rows
restore
keep if impute == 0

// set up the donor rows.  the key idea here is to set up such 
// that each donor row represents a probability interval where
// the ordering of the intervals in a cell in random (based on
// the variable "sort_order" and the width of the interval is
// proportional to the weight
sort type sort_order
by type:  gen weight_sum = sum($weight)
by type:  gen impute_weight = $weight / weight_sum[_N]
by type:  replace impute_weight = sum(impute_weight)
drop weight_sum

// append with recipients so we again have a full datasets
// with both donors and recipients
append using recipients

// now we intersperse the donors and recipients using "sort_order"
// which is based on randomness and weight for the donors and
// is purely random for the recipients
replace sort_order = impute_weight if impute_weight != .
gsort type -sort_order

// fill recipient variables from donor rows.  conceptually
// this is very simple.  each recipient row is in within the
// range of some donor cell.  in practice, that is simply 
// the nearest preceding donor cell 
foreach v in $miss_vars {
   by type: replace `v' = `v'[_n-1] if impute == 1
}

// however, there's a minor practical issue that recipient
// cells that are in the range of the first donor cell need
// to be filled by the nearest successive donor cell, which
// can be done by reversing the sort and then filling from
// the nearest preceding donor cell
gsort type sort_order
foreach v in $miss_vars {
   by type: replace `v' = `v'[_n-1] if `v' == .
}

来源：https://stackoverflow.com/questions/53324137/simple-way-to-do-a-weighted-hot-deck-imputation-in-stata

标签

sas

stata

imputation