I have a large dataset that chokes split() in R. I am able to use dplyr group_by (which is a preferred way anyway) but I am unable to persist the r
Comparing the base, plyr and dplyr solutions, it still seems the base one is much faster!
library(plyr)
library(dplyr)
df <- data_frame(Group1=rep(LETTERS, each=1000),
Group2=rep(rep(1:10, each=100),26),
Value=rnorm(26*1000))
microbenchmark(Base=df %>%
split(list(.$Group2, .$Group1)),
dplyr=df %>%
group_by(Group1, Group2) %>%
do(data = (.)) %>%
select(data) %>%
lapply(function(x) {(x)}) %>% .[[1]],
plyr=dlply(df, c("Group1", "Group2"), as.tbl),
times=50)
Gives:
Unit: milliseconds
expr min lq mean median uq max neval
Base 12.82725 13.38087 16.21106 14.58810 17.14028 41.67266 50
dplyr 25.59038 26.66425 29.40503 27.37226 28.85828 77.16062 50
plyr 99.52911 102.76313 110.18234 106.82786 112.69298 140.97568 50