To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id. Suppose:
N
If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
split(dt, as.numeric(as.factor(dt$id)) %% M)
It assigns id to the the bucket using factor-value mod number-of buckets.
For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.