问题
I love R but some problems are just plain hard.
The challenge is to find the first instance of a rolling sum that is less than 30 in an irregular time series having a time-based window greater than or equal to 6 hours. I have a sample of the series
Row Person DateTime Value
1 A 2014-01-01 08:15:00 5
2 A 2014-01-01 09:15:00 5
3 A 2014-01-01 10:00:00 5
4 A 2014-01-01 11:15:00 5
5 A 2014-01-01 14:15:00 5
6 B 2014-01-01 08:15:00 25
7 B 2014-01-01 10:15:00 25
8 B 2014-01-01 19:15:00 2
9 C 2014-01-01 08:00:00 20
10 C 2014-01-01 09:00:00 5
11 C 2014-01-01 13:45:00 1
12 D 2014-01-01 07:00:00 1
13 D 2014-01-01 08:15:00 13
14 D 2014-01-01 14:15:00 15
For Person A, Rows 1 & 5 create a minimum 6 hour interval with a running sum of 25 (which is less than 30).
For Person B, Rows 7 & 8 create a 9 hour interval with a running sum of 27 (again less than 30).
For Person C, using Rows 9 & 10, there is no minimum 6 hour interval (it is only 5.75 hours) although the running sum is 26 and is less than 30.
For Person D, using Rows 12 & 14, the interval is 7.25 hours but the running sum is 30 and is not less than 30.
Given n observations, there are n*(n-1)/2 intervals that must be compared. For example, with n=2 there is just 1 interval to evaluate. For n=3 there are 3 intervals. And so on.
I assume that this is an variation of the subset sum problem (http://en.wikipedia.org/wiki/Subset_sum_problem)
While the data can be sorted I suspect this requires a brute force solution testing each interval.
Any help would be appreciated.
Edit: here's the data with DateTime column formatted as POSIXct:
df <- structure(list(Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
DateTime = structure(c(1388560500, 1388564100, 1388566800,
1388571300, 1388582100, 1388560500, 1388567700, 1388600100,
1388559600, 1388563200, 1388580300, 1388556000, 1388560500,
1388582100), class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L,
1L, 13L, 15L)), .Names = c("Person", "DateTime", "Value"), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14"), class = "data.frame")
回答1:
I have found this to be a difficult problem in R as well. So I made a package for it!
library("devtools")
install_github("boRingTrees","mgahan")
require(boRingTrees)
Of course, you will have to figure out your units correctly for the upper bound.
Here is some more documentation if you are interested. https://github.com/mgahan/boRingTrees
For the data df
that @beginneR provided, you could use the following code to get a 6 hour rolling sum.
require(data.table)
setDT(df)
df[ , roll := rollingByCalcs(df,dates="DateTime",target="Value",
by="Person",stat=sum,lower=0,upper=6*60*60)]
Person DateTime Value roll
1: A 2014-01-01 01:15:00 5 5
2: A 2014-01-01 02:15:00 5 10
3: A 2014-01-01 03:00:00 5 15
4: A 2014-01-01 04:15:00 5 20
5: A 2014-01-01 07:15:00 5 25
6: B 2014-01-01 01:15:00 25 25
7: B 2014-01-01 03:15:00 25 50
8: B 2014-01-01 12:15:00 2 2
9: C 2014-01-01 01:00:00 20 20
10: C 2014-01-01 02:00:00 5 25
11: C 2014-01-01 06:45:00 1 26
12: D 2014-01-01 00:00:00 1 1
13: D 2014-01-01 01:15:00 13 14
14: D 2014-01-01 07:15:00 15 28
The original post is pretty unclear to me, so this might not be exactly what he wanted. If a column with the desired output was presented, I imagine I could be of more help.
回答2:
We assume that an interval is defined by two rows for the same person. For each person, We want the first such interval (time-wise) of at least 6 hours for which the sum of Value
of those two rows and any intermediate rows is less than 30. If there is more than one such first interval for a person pick one arbitrarily.
This can be represented by a triple join in SQL. The inner select picks out all rows consisting of the start of interval (a.DateTime
), the end of interval (b.DateTime
) and rows between them (c.DateTime
) grouping by Person
and interval and summing over the Value
provided it spans at least 6 hours
. The outer select then keeps only those rows whose total
is < 30 and for each Person
keeps only the one whose DateTime
is least. If there is more than one first row (time-wise) for a Person
it picks one arbitrarily.
library(sqldf)
sqldf(
"select Person, min(Datetime) DateTime, hours, total
from (select a.Person,
a.DateTime,
(b.Datetime - a.DateTime)/3600 hours,
sum(c.Value) total
from DF a join DF b join DF c
on a.Person = b.Person and a.Person = c.Person and hours >= 6
and c.DateTime between a.DateTime and b.DateTime
group by a.Person, a.DateTime, b.DateTime)
where total < 30
group by Person"
)
giving:
Person DateTime hours total
1 A 2014-01-01 08:15:00 6.00 25
2 B 2014-01-01 10:15:00 9.00 27
3 D 2014-01-01 07:00:00 7.25 29
Note: We used this data:
DF <- data.frame( Row = 1:14,
Person = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
DateTime = structure(c(1388582100, 1388585700, 1388588400, 1388592900,
1388603700, 1388582100, 1388589300, 1388621700, 1388581200,
1388584800, 1388601900, 1388577600, 1388582100, 1388603700),
class = c("POSIXct", "POSIXt"), tzone = ""),
Value = c(5L, 5L, 5L, 5L, 5L, 25L, 25L, 2L, 20L, 5L, 1L, 1L, 13L, 15L) )
回答3:
As of version 1.9.8 (on CRAN 25 Nov 2016), the data.table package has gained the ability to aggregate in a non-equi join.
library(data.table)
tmp <- setDT(df)[, CJ(start = DateTime, end = DateTime)[
, hours := difftime(end, start, units = "hours")][hours >= 6], by = Person]
df[tmp, on = .(Person, DateTime >= start, DateTime <= end),
.(hours, total = sum(Value)), by = .EACHI][
total < 30, .SD[1L], by = Person]
Person DateTime hours total 1: A 2014-01-01 08:15:00 6.00 hours 25 2: B 2014-01-01 10:15:00 9.00 hours 27 3: D 2014-01-01 07:00:00 7.25 hours 29
tmp
contains all possible intervals of 6 and more hours for each person. It is created through a cross join CJ()
and subsequent filtering:
tmp
Person start end hours 1: A 2014-01-01 08:15:00 2014-01-01 14:15:00 6.00 hours 2: B 2014-01-01 08:15:00 2014-01-01 19:15:00 11.00 hours 3: B 2014-01-01 10:15:00 2014-01-01 19:15:00 9.00 hours 4: D 2014-01-01 07:00:00 2014-01-01 14:15:00 7.25 hours 5: D 2014-01-01 08:15:00 2014-01-01 14:15:00 6.00 hours
These intervals are being used to aggregate over in the non-equi join. The result is filtered for a total value of less than 30 and the first occurrence for each person is picked, finally.
来源:https://stackoverflow.com/questions/25124201/r-compute-a-rolling-sum-on-irregular-time-series-grouped-by-id-variables-with-t