Why is split inefficient on large data frames with many groups?

前端 未结 3 636
心在旅途
心在旅途 2021-01-12 07:09
df %>% split(.$x)

becomes slow for large number of unique values of x. If we instead split the data frame manually into smaller subsets and then

3条回答
  •  無奈伤痛
    2021-01-12 07:33

    A very nice cheat exploiting the group_split of dplyr 0.8.3 or above :

    random_df <- tibble(colA= paste("A",1:1200000,sep = "_"), 
                        colB= as.character(paste("A",1:1200000,sep = "_")),
                        colC= 1:1200000)
    
    random_df_list <- split(random_df, random_df$colC)
    
    random_df_list <- random_df %>% group_split(colC)
    

    Reduces an operation of a few minutes to a few seconds !

提交回复
热议问题