relative windowed running sum through data.table non-equi join

匆匆过客 提交于 2019-11-30 20:30:46
ExperimenteR

First, we find how many transaction dates occur in 45 day window prior to the current date (including current date)

setDT(df)
df[, n:= 1:.N - findInterval(transactionDate - 45, transactionDate), by=.(customerID)]
df
#   productId customerID transactionDate purchaseQty n
#1:    870826    1186951      2016-03-28      162000 1
#2:    870826    1244216      2016-03-31        5000 1
#3:    870826    1244216      2016-04-08        6500 2
#4:    870826    1308671      2016-03-28      221367 1
#5:    870826    1308671      2016-03-29       83633 2
#6:    870826    1308671      2016-11-29       60500 1

Next we find a rolling sum of purchaseQty with window size n. Adopting a great answer here

g <- function(x, window){
  b_pos <- seq_along(x) - window + 1  # begin positions
  cum <- cumsum(x)
  cum - cum[b_pos] + x[b_pos]
}
df[, sumWindowPurchases := g(purchaseQty, n),][,n:=NULL,]
df
#   productId customerID transactionDate purchaseQty sumWindowPurchases
#1:    870826    1186951      2016-03-28      162000             162000
#2:    870826    1244216      2016-03-31        5000               5000
#3:    870826    1244216      2016-04-08        6500              11500
#4:    870826    1308671      2016-03-28      221367             221367
#5:    870826    1308671      2016-03-29       83633             305000
#6:    870826    1308671      2016-11-29       60500              60500

Data

structure(list(productId = c(870826L, 870826L, 870826L, 870826L, 
870826L, 870826L), customerID = c(1186951L, 1244216L, 1244216L, 
1308671L, 1308671L, 1308671L), transactionDate = structure(c(16888, 
16891, 16899, 16888, 16889, 17134), class = "Date"), purchaseQty = c(162000L, 
5000L, 6500L, 221367L, 83633L, 60500L)), .Names = c("productId", 
"customerID", "transactionDate", "purchaseQty"), row.names = c("1:", 
"2:", "3:", "4:", "5:", "6:"), class = "data.frame")

This also works, it could be considered simpler. It has the advantage of not requiring a sorted input set, and has fewer dependencies.

I still don't know understand why it produces 2 transactionDate columns in the output. This seems to be a byproduct of the "on" clause. In fact, columns and order of the output seems to append the sum after all elements of the on clause, without their alias names

DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
    on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
    .(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!