发表新帖

发表新帖

Why is split inefficient on large data frames with many groups?

前端未结

关注

 3  651

心在旅途 2021-01-12 07:09

df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then

3条回答

無奈伤痛 (楼主)

2021-01-12 07:35
This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups.
You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.
```
system.time(
    l1 <- df %>% split(.$x)   
)
#   user  system elapsed 
#200.936   0.000 217.496 
library(data.table)
dt = as.data.table(df)
system.time(
    l2 <- split(dt, by="x")   
)
#   user  system elapsed 
#  7.372   0.000   6.875 
system.time(
    l3 <- split(dt, by="x", sorted=TRUE)   
)
#   user  system elapsed 
#  9.068   0.000   8.200 
```
sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple
```
install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")
```
More about that in Installation wiki.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题