To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id
. Suppose:
N
Preliminary comment
I recommend reading what the main author of data.table has to say about parallelization with it.
I don't know how familiar you are with data.table, but you may have overlooked its by
argument...? Quoting @eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
dt[, parallel_operation(.SD), by = parallel.id]
Answer, assuming you don't want to use by
Sort the IDs by size:
ids <- names(sort(table(dt$id)))
n <- length(ids)
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
gs <- split(alt_ids, ceiling(seq(n) / (n/M)))
res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])]
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],]
Check that the sizes aren't too bad:
# using the OP's example data...
sapply(res, nrow)
# [1] 7 9 for M = 2
# [1] 5 5 6 for M = 3
# [1] 1 6 3 6 for M = 4
# [1] 1 4 2 3 6 for M = 5
Although I emphasized data.table
at the top, this should work fine with a data.frame
, too.