How to speed up subset by groups

后端 未结 2 877
被撕碎了的回忆
被撕碎了的回忆 2020-11-29 06:23

I used to achieve my data wrangling with dplyr, but some of the computations are \"slow\". In particular subset by groups, I read that dplyr is slow when there is a lot of g

2条回答
  •  遥遥无期
    2020-11-29 06:24

    How about summarizing the data.table and join original data

    system.time({
      datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
      setkey(datas1, id1, id2, datetime)
      setkey(datas.dt, id1, id2, datetime)
      datas2 <- datas.dt[datas1]
    })
    #  user  system elapsed 
    # 0.083   0.000   0.084 
    

    which correctly filters the data

    system.time(dat1 <- datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
    #   user  system elapsed 
    # 23.226   0.000  23.256 
    all.equal(dat1, datas2)
    # [1] TRUE
    

    Addendum

    setkey argument is superfluous if you are using the devel version of the data.table (Thanks to @akrun for the pointer)

    system.time({
      datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
      datas2 <- datas.dt[datas1, on=c('id1', 'id2', 'datetime')]
    })
    

提交回复
热议问题