发表新帖

发表新帖

Split data.table into roughly equal parts

前端未结

关注

 4  1118

抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答

情书的邮戳 (楼主)

2021-01-20 14:25
If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:
```
split(dt, as.numeric(as.factor(dt$id)) %% M)
```
It assigns id to the the bucket using factor-value mod number-of buckets.

For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题