bootstrap resampling for hierarchical/multilevel data

问题

I am attempting to do bootstrap resampling on a multilevel/hierarchical dataset. The observations are (unique) patients clustered within hospitals.

My strategy is to sample with replacement from the patients within each hospital in turn, which will ensure that all hospitals are represented in the sample and that when repeated all the samples sizes will be the same. This is method 2 here.

My code is like this:

hv <- na.omit(unique(dt$hospital))

samp.out <- NULL

for (hosp in hv ) {
    ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]
    ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),]
    samp.out <- rbind(samp.out,ss2)
}

This seems to work (though if anyone can see any problem I would be grateful).

The issue is that it is slow, so I would like to know if there are ways to speed this up.

Update:

I have tried to implement Ari B. Friedman's answer but without success - so I have modified it slightly, with the aim of constructing a vector which then indexes the original dataframe. Here is my new code:

# this is a vector that will hold unique IDs
v.samp <- rep(NA, nrow(dt))

#entry to fill next
i <- 1

for (hosp in hv ) {
    ss1 <- dt[dt$hospital==hosp & !is.na(dt$hospital),]

    # column 1 contains a unique ID
    ss2 <- ss1[sample(1:nrow(ss1),nrow(ss1), replace=T),1]
    N.fill <- length(ss2)
    v.samp[ seq(i,i+N.fill-1) ] <- ss2

    # update entry to fill next
    i <- i + N.fill
}

samp.out <- dt[dt$unid %in% v.samp,]

This is fast ! BUT, it fails to work properly because it only selects the unique IDs of v.samp in the final line, but the sampling is with replacement so there are repeated IDs in v.samp. Any further help will be much appreciated

回答1:

The usual trick to speeding up bootstrapping is to draw the whole sample (all replicates) for each hospital at once, then assign them to replicates. That way you only run ss1<- once per hospital. You can likely improve on that by not subsetting for each hospital. Another huge win might come from pre-allocating rather than rbinding. More suggestions on speed improvements.

To re-allocate, calculate how many entries you need (call it N.out). Then, just before your loop, add:

samp.out <- rep(NA, N.out)

And replace your rbind line with:

samp.out[ seq(i,i+N.iter) ] <- ss2

Where i is your calculation of the first entry not yet filled, and i+N.iter is the last entry you have data to fill on this round.

See the R Inferno for more details and tricks.

Update

You have two approaches and you're mixing them. You can either make v.samp a data.frame and just sample all the rows into it in real-time, or you can sample IDs, and then select a data.frame using the vector of IDs outside of the loop. The key to the latter is that myDF[c(1,1,5,2,3),] will give you a data.frame which repeats the first row--exactly what you want, and exactly what that feature was designed for. Make sure v.samp is an ID that you can select from a data.frame on (either a row number or a row name), then select outside the loop.

来源：https://stackoverflow.com/questions/12983038/bootstrap-resampling-for-hierarchical-multilevel-data

标签

resampling