Data.table: how to get the blazingly fast subsets it promises and apply to a second data.table

ぃ、小莉子 提交于 2019-12-01 11:16:52

This can be solved by updating in a non-equi join.

This avoids the memory issues caused by a cartesian join or by calling apply() which coerces a data.frame or data.table to a matrix which involves copying the data.

In addition, the OP has mentioned that lsr has a few hundred mio. rows and adherence has 1.5 mio rows (500 timeperiods times 3000 ID's). Therefore, efficient storage of data items will not only reduce the memory footprint but may also reduce the share of processing time which is required for loading data.

library(data.table)
# coerce to data.table by reference, i.e., without copying
setDT(adherence)
setDT(lsr)
# coerce to IDate to save memory
adherence[, year := as.IDate(year)]
cols <- c("eksd", "ENDDATE")
lsr[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
# update in a non-equi join
adherence[lsr, on = .(ID, year >= eksd, year < ENDDATE), 
                      AH := as.integer(ENDDATE - x.year)][]
   ID       year AH
1:  1 2013-01-01 NA
2:  2 2013-01-01 NA
3:  3 2013-01-01 NA
4:  1 2013-02-01 64
5:  2 2013-02-01 NA
6:  3 2013-02-01 63

Note that NA indicates that no match was found. If required, the AH column can be initialised before the non-equi join by adherence[, AH := 0L].

Data

The code to create the sample datasets can be streamlined:

adherence <- data.frame(
  ID = c("1", "2", "3", "1", "2", "3"), 
  year = as.Date(c("2013-01-01", "2013-01-01", "2013-01-01", "2013-02-01", "2013-02-01", "2013-02-01")),
  stringsAsFactors = FALSE)

lsr <- data.frame(
  ID = c("1", "1", "1", "2", "2", "2", "3", "3"),
  eksd = as.Date(c("2012-03-01", "2012-08-02", "2013-01-06","2012-08-25", "2013-03-22", "2013-09-15", "2011-01-01", "2013-01-05")),
  DDD = as.integer(c("60", "90", "90", "60", "120", "60", "30", "90")),
  stringsAsFactors = FALSE)
lsr$ENDDATE <- lsr$eksd + lsr$DDD

Note that DDD is of type integer which usually requires 4 bytes instead of 8 bytes for type numeric/double.

Also note that the last statement may cause the whole data object lsr to be copied. This can be avoided by using data.table syntax which updates by reference.

library(data.table)
setDT(lsr)[, ENDDATE := eksd + DDD][]

I am not sure why your function is slow (I think you could remove your ifelse function), but I would propose to use merge to be faster and to operate on one table only:

plouf <- lsr[adherence, on = "ID", allow.cartesian=TRUE]
plouf[,year := as.date(year)]
bob <- rbindlist(lapply(unique(adherence$year),function(x){
  plouf <- lsr[adherence[year == x], on = "ID"]
  plouf[,year := as.Date(year)]
  plouf[year >= eksd & year < ENDDATE,list(sum = sum(as.numeric(ENDDATE-as.Date(year))), year = year), by = ID]
  }))
bob

   ID sum       year
1:  1  64 2013-02-01
2:  3  63 2013-02-01

you can then merge to adherence

adherence <- setDT(adherence)
adherence[,year := as.Date(year)]
bob[adherence, on = .(ID,year)]
   ID sum       year
1:  1  NA 2013-01-01
2:  2  NA 2013-01-01
3:  3  NA 2013-01-01
4:  1  64 2013-02-01
5:  2  NA 2013-02-01
6:  3  63 2013-02-01

For reading your data use fread() function that is fast for big data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!