How to speed up subset by groups

后端未结

关注

 2  877

被撕碎了的回忆 2020-11-29 06:23

I used to achieve my data wrangling with dplyr, but some of the computations are \"slow\". In particular subset by groups, I read that dplyr is slow when there is a lot of g

2条回答

遥遥无期 (楼主)

2020-11-29 06:24

How about summarizing the data.table and join original data

system.time({
  datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
  setkey(datas1, id1, id2, datetime)
  setkey(datas.dt, id1, id2, datetime)
  datas2 <- datas.dt[datas1]
})
#  user  system elapsed 
# 0.083   0.000   0.084

which correctly filters the data

system.time(dat1 <- datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
#   user  system elapsed 
# 23.226   0.000  23.256 
all.equal(dat1, datas2)
# [1] TRUE

Addendum

setkey argument is superfluous if you are using the devel version of the data.table (Thanks to @akrun for the pointer)

system.time({
  datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
  datas2 <- datas.dt[datas1, on=c('id1', 'id2', 'datetime')]
})

0 讨论(0)

查看其它2个回答