Split data.table into roughly equal parts

前端 未结 4 1118
抹茶落季
抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答
  •  情书的邮戳
    2021-01-20 14:25

    If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:

    split(dt, as.numeric(as.factor(dt$id)) %% M)
    

    It assigns id to the the bucket using factor-value mod number-of buckets.

    For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.

提交回复
热议问题