发表新帖

发表新帖

Split data.table into roughly equal parts

前端未结

关注

 4  1104

抹茶落季 2021-01-20 13:59

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N

4条回答

温柔的废话 (楼主)

2021-01-20 14:13
Preliminary comment

I recommend reading what the main author of data.table has to say about parallelization with it.

I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting @eddi's comment from below...
Instead of literally splitting up the data - create a new "parallel.id" column, and then call
```
dt[, parallel_operation(.SD), by = parallel.id] 
```
Answer, assuming you don't want to use by

Sort the IDs by size:
```
ids   <- names(sort(table(dt$id)))
n     <- length(ids)
```
Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:
```
alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]
```
Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):
```
gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))

res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],] 
```
Check that the sizes aren't too bad:
```
# using the OP's example data...

sapply(res, nrow)
# [1] 7 9              for M = 2
# [1] 5 5 6            for M = 3
# [1] 1 6 3 6          for M = 4
# [1] 1 4 2 3 6        for M = 5
```
Although I emphasized data.table at the top, this should work fine with a data.frame, too.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题