I used to achieve my data wrangling with dplyr, but some of the computations are \"slow\". In particular subset by groups, I read that dplyr is slow when there is a lot of g
How about summarizing the data.table and join
original data
system.time({
datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
setkey(datas1, id1, id2, datetime)
setkey(datas.dt, id1, id2, datetime)
datas2 <- datas.dt[datas1]
})
# user system elapsed
# 0.083 0.000 0.084
which correctly filters the data
system.time(dat1 <- datas.dt[datas.dt[, .I[datetime == max(datetime)], by = c("id1", "id2")]$V1])
# user system elapsed
# 23.226 0.000 23.256
all.equal(dat1, datas2)
# [1] TRUE
Addendum
setkey
argument is superfluous if you are using the devel version of the data.table
(Thanks to @akrun for the pointer)
system.time({
datas1 <- datas.dt[, list(datetime=max(datetime)), by = c("id1", "id2")] #summarize the data
datas2 <- datas.dt[datas1, on=c('id1', 'id2', 'datetime')]
})