df %>% split(.$x)
becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then
This isn't strictly split.data.frame
issue, there is a more general problem on scalability of data.frame for many groups.
You can get pretty nice speed up if you use split.data.table
. I developed this method on top of regular data.table methods and it seems to scale pretty well here.
system.time(
l1 <- df %>% split(.$x)
)
# user system elapsed
#200.936 0.000 217.496
library(data.table)
dt = as.data.table(df)
system.time(
l2 <- split(dt, by="x")
)
# user system elapsed
# 7.372 0.000 6.875
system.time(
l3 <- split(dt, by="x", sorted=TRUE)
)
# user system elapsed
# 9.068 0.000 8.200
sorted=TRUE
will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF)
.
PS. split.data.table
was added in 1.9.7, installation of devel version is pretty simple
install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")
More about that in Installation wiki.