R: Compute a rolling sum on irregular time series grouped by id variables with time-based window

白昼怎懂夜的黑 提交于 2019-12-09 11:47:40


I love R but some problems are just plain hard.

The challenge is to find the first instance of a rolling sum that is less than 30 in an irregular time series having a time-based window greater than or equal to 6 hours. I have a sample of the series

Row Person  DateTime    Value
1   A   2014-01-01 08:15:00 5
2   A   2014-01-01 09:15:00 5
3   A   2014-01-01 10:00:00 5
4   A   2014-01-01 11:15:00 5
5   A   2014-01-01 14:15:00 5
6   B   2014-01-01 08:15:00 25
7   B   2014-01-01 10:15:00 25
8   B   2014-01-01 19:15:00 2
9   C   2014-01-01 08:00:00 20
10  C   2014-01-01 09:00:00 5
11  C   2014-01-01 13:45:00 1
12  D   2014-01-01 07:00:00 1
13  D   2014-01-01 08:15:00 13
14  D   2014-01-01 14:15:00 15

For Person A, Rows 1 & 5 create a minimum 6 hour interval with a running sum of 25 (which is less than 30).
For Person B, Rows 7 & 8 create a 9 hour interval with a running sum of 27 (again less than 30).
For Person C, using Rows 9 & 10, there is no minimum 6 hour interval (it is only 5.75 hours) although the running sum is 26 and is less than 30.
For Person D, using Rows 12 & 14, the interval is 7.25 hours but the running sum is 30 and is not less than 30.

Given n observations, there are n*(n-1)/2 intervals that must be compared. For example, with n=2 there is just 1 interval to evaluate. For n=3 there are 3 intervals. And so on.

I assume that this is an variation of the subset sum problem (http://en.wikipedia.org/wiki/Subset_sum_problem)

While the data can be sorted I suspect this requires a brute force solution testing each interval.

Any help would be appreciated.

Edit: here's the data with DateTime column formatted as POSIXct:

df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
DateTime = structure(c(1388560500, 1388564100, 1388566800, 
1388571300, 1388582100, 1388560500, 1388567700, 1388600100, 
1388559600, 1388563200, 1388580300, 1388556000, 1388560500, 
1388582100), class = c("POSIXct", "POSIXt"), tzone = ""), 
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L, 
1L, 13L, 15L)), .Names = c("Person", "DateTime", "Value"), row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14"), class = "data.frame")


I have found this to be a difficult problem in R as well. So I made a package for it!


Of course, you will have to figure out your units correctly for the upper bound.

Here is some more documentation if you are interested. https://github.com/mgahan/boRingTrees

For the data df that @beginneR provided, you could use the following code to get a 6 hour rolling sum.

df[ , roll := rollingByCalcs(df,dates="DateTime",target="Value",

    Person            DateTime Value roll
 1:      A 2014-01-01 01:15:00     5    5
 2:      A 2014-01-01 02:15:00     5   10
 3:      A 2014-01-01 03:00:00     5   15
 4:      A 2014-01-01 04:15:00     5   20
 5:      A 2014-01-01 07:15:00     5   25
 6:      B 2014-01-01 01:15:00    25   25
 7:      B 2014-01-01 03:15:00    25   50
 8:      B 2014-01-01 12:15:00     2    2
 9:      C 2014-01-01 01:00:00    20   20
10:      C 2014-01-01 02:00:00     5   25
11:      C 2014-01-01 06:45:00     1   26
12:      D 2014-01-01 00:00:00     1    1
13:      D 2014-01-01 01:15:00    13   14
14:      D 2014-01-01 07:15:00    15   28

The original post is pretty unclear to me, so this might not be exactly what he wanted. If a column with the desired output was presented, I imagine I could be of more help.


We assume that an interval is defined by two rows for the same person. For each person, We want the first such interval (time-wise) of at least 6 hours for which the sum of Value of those two rows and any intermediate rows is less than 30. If there is more than one such first interval for a person pick one arbitrarily.

This can be represented by a triple join in SQL. The inner select picks out all rows consisting of the start of interval (a.DateTime), the end of interval (b.DateTime) and rows between them (c.DateTime) grouping by Person and interval and summing over the Value provided it spans at least 6 hours. The outer select then keeps only those rows whose total is < 30 and for each Person keeps only the one whose DateTime is least. If there is more than one first row (time-wise) for a Person it picks one arbitrarily.


     "select Person, min(Datetime) DateTime, hours, total 
      from (select a.Person, 
          (b.Datetime - a.DateTime)/3600 hours, 
          sum(c.Value) total
          from DF a join DF b join DF c
          on a.Person = b.Person and a.Person = c.Person and hours >= 6
          and c.DateTime between a.DateTime and b.DateTime
          group by a.Person, a.DateTime, b.DateTime)
      where total < 30
      group by Person"


  Person            DateTime hours total
1      A 2014-01-01 08:15:00  6.00    25
2      B 2014-01-01 10:15:00  9.00    27
3      D 2014-01-01 07:00:00  7.25    29

Note: We used this data:

DF <- data.frame( Row = 1:14,
  Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 
             4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
  DateTime = structure(c(1388582100, 1388585700, 1388588400, 1388592900, 
             1388603700, 1388582100, 1388589300, 1388621700, 1388581200, 
             1388584800, 1388601900, 1388577600, 1388582100, 1388603700), 
             class = c("POSIXct", "POSIXt"), tzone = ""),
  Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L, 1L, 13L, 15L) ) 


As of version 1.9.8 (on CRAN 25 Nov 2016), the data.table package has gained the ability to aggregate in a non-equi join.

tmp <- setDT(df)[, CJ(start = DateTime, end = DateTime)[
  , hours := difftime(end, start, units = "hours")][hours >= 6], by = Person]
df[tmp, on = .(Person, DateTime >= start, DateTime <= end), 
  .(hours, total = sum(Value)), by = .EACHI][
    total < 30, .SD[1L], by = Person]
   Person            DateTime      hours total
1:      A 2014-01-01 08:15:00 6.00 hours    25
2:      B 2014-01-01 10:15:00 9.00 hours    27
3:      D 2014-01-01 07:00:00 7.25 hours    29

tmp contains all possible intervals of 6 and more hours for each person. It is created through a cross join CJ() and subsequent filtering:

   Person               start                 end       hours
1:      A 2014-01-01 08:15:00 2014-01-01 14:15:00  6.00 hours
2:      B 2014-01-01 08:15:00 2014-01-01 19:15:00 11.00 hours
3:      B 2014-01-01 10:15:00 2014-01-01 19:15:00  9.00 hours
4:      D 2014-01-01 07:00:00 2014-01-01 14:15:00  7.25 hours
5:      D 2014-01-01 08:15:00 2014-01-01 14:15:00  6.00 hours

These intervals are being used to aggregate over in the non-equi join. The result is filtered for a total value of less than 30 and the first occurrence for each person is picked, finally.

