Why is split inefficient on large data frames with many groups?

前端 未结 3 637
心在旅途
心在旅途 2021-01-12 07:09
df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then

3条回答
  •  無奈伤痛
    2021-01-12 07:35

    This isn't strictly split.data.frame issue, there is a more general problem on scalability of data.frame for many groups.
    You can get pretty nice speed up if you use split.data.table. I developed this method on top of regular data.table methods and it seems to scale pretty well here.

    system.time(
        l1 <- df %>% split(.$x)   
    )
    #   user  system elapsed 
    #200.936   0.000 217.496 
    library(data.table)
    dt = as.data.table(df)
    system.time(
        l2 <- split(dt, by="x")   
    )
    #   user  system elapsed 
    #  7.372   0.000   6.875 
    system.time(
        l3 <- split(dt, by="x", sorted=TRUE)   
    )
    #   user  system elapsed 
    #  9.068   0.000   8.200 
    

    sorted=TRUE will return the list of the same order as data.frame method, by default data.table method will preserve order present in input data. If you want to stick to data.frame you can at the end use lapply(l2, setDF).

    PS. split.data.table was added in 1.9.7, installation of devel version is pretty simple

    install.packages("data.table", type="source", repos="http://Rdatatable.github.io/data.table")
    

    More about that in Installation wiki.

提交回复
热议问题