问题
I have a data.table
(~30 million rows) consisting of a datetime
column in POSIXct
format, an id
column and a few other columns (in the example, I just left one irrelevant column x
to demonstrate that there are other columns present that need to be kept). A dput
is at the bottom of the post.
head(DT)
# datetime x id
#1: 2016-04-28 16:20:18 0.02461368 1
#2: 2016-04-28 16:41:34 0.88953932 1
#3: 2016-04-28 16:46:07 0.31818101 1
#4: 2016-04-28 17:00:56 0.14711365 1
#5: 2016-04-28 17:09:11 0.54406602 1
#6: 2016-04-28 17:39:09 0.69280341 1
Q: For each id
, I need to subset only those observations that differ by more than 30 minutes time. What could be an efficient data.table
approach to do this (if possible, without extensive looping)?
The logic can also be described as (like in my comment below):
Per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
In the dput below, I added a colum keep
to indicate which rows should be kept in this example because they differ by more than 30 minutes from the previous observation that is kept per id. The difficulty is that it seems to be necessary to calculate the time differences iteratively (or at least, I cannot think of a more efficient approach at the moment).
library(data.table)
DT <- structure(list(
datetime = structure(c(1461853218.81561, 1461854494.81561,
1461854767.81561, 1461855656.81561, 1461856151.81561, 1461857949.81561,
1461858601.81561, 1461858706.81561, 1461859078.81561, 1461859103.81561,
1461852799.81561, 1461852824.81561, 1461854204.81561, 1461855331.81561,
1461855633.81561, 1461856311.81561, 1461856454.81561, 1461857177.81561,
1461858662.81561, 1461858996.81561), class = c("POSIXct", "POSIXt")),
x = c(0.0246136845089495, 0.889539316063747, 0.318181007634848,
0.147113647311926, 0.544066024711356, 0.6928034061566, 0.994269776623696,
0.477795971091837, 0.231625785352662, 0.963024232536554, 0.216407935833558,
0.708530468167737, 0.758459537522867, 0.640506813768297, 0.902299045119435,
0.28915973729454, 0.795467417687178, 0.690705278422683, 0.59414202044718,
0.655705799115822),
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
keep = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)),
.Names = c("datetime", "x", "id", "keep"),
row.names = c(NA, -20L),
class = c("data.table", "data.frame"))
setkey(DT, id, datetime)
DT[, difftime := difftime(datetime, shift(datetime, 1L, NA,type="lag"), units = "mins"),
by = id]
DT[is.na(difftime), difftime := 0]
DT[, difftime := cumsum(as.numeric(difftime)), by = id]
Explanation of the keep
column:
- Rows 2:3 differ by less than 30 minutes from row 1 -> delete
- Row 4 differs by more than 30 minutes from row 1 -> keep
- Row 5 dufferes by less than 30 minutes from row 4 -> delete
- Row 6 differs by more than 30 minutes from row 4 -> keep
- ...
Desired output:
desiredDT <- DT[(keep)]
Thanks for three expert answers I received. I tested them on 1 and 10 million rows of data. Here's an excerpt of the benchmarks.
a) 1 million rows
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.286647 1.277104 1.185216 1.267769 1.140614 1.036749 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.748622 11.697409 10.941792 11.647320 10.587002 9.720901 3
# eddi2(DT_Eddi2) 9.966078 9.915651 9.210168 9.866330 8.877769 8.070281 3
b) 10 million rows
microbenchmark(frank(DT_Frank), roland(DT_Roland), eddi1(DT_Eddi1), eddi2(DT_Eddi2),
times = 3L, unit = "relative")
#Unit: relative
# expr min lq mean median uq max neval
# frank(DT_Frank) 1.019561 1.025427 1.026681 1.031061 1.030028 1.029037 3
# roland(DT_Roland) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3
# eddi1(DT_Eddi1) 11.567302 11.443146 11.301487 11.323914 11.176515 11.035143 3
# eddi2(DT_Eddi2) 9.796800 9.693823 9.526193 9.594931 9.398969 9.211019 3
Apparently, @Frank's data.table approach and @Roland's Rcpp based solution are similar in performance with Rcpp having a slight advantage, while @eddi's approaches were still fast but not as performant as the others.
However, when I checked for equality of the solutions, I found that @Roland's approach has a slightly different result than the others:
a) 1 million rows
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (982228, 982224) differ"
#[2] "Component “id”: Numeric: lengths (982228, 982224) differ"
#[3] "Component “x”: Numeric: lengths (982228, 982224) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
b) 10 million rows
all.equal(frank(DT_Frank), roland(DT_Roland))
#[1] "Component “datetime”: Numeric: lengths (9981898, 9981891) differ"
#[2] "Component “id”: Numeric: lengths (9981898, 9981891) differ"
#[3] "Component “x”: Numeric: lengths (9981898, 9981891) differ"
all.equal(frank(DT_Frank), eddi1(DT_Eddi1))
#[1] TRUE
all.equal(frank(DT_Frank), eddi2(DT_Eddi2))
#[1] TRUE
My current assumption is that this difference might be related to whether the differnce is > 30 minutes or >= 30 minutes though I'm not sure about that yet.
Final thought: I decided to go with @Frank's solution for two reasons: 1. it performs very well, almost equal to the Rcpp solution, and 2. it doesn't require another package with which I'm not very familiar yet (I'm using data.table anyway)
回答1:
Here's what I would do:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
which gives
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
The idea behind it is described by the OP in a comment:
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
回答2:
Using Rcpp:
library(Rcpp)
library(inline)
cppFunction(
'LogicalVector selecttimes(const NumericVector x) {
const int n = x.length();
LogicalVector res(n);
res(0) = true;
double testval = x(0);
for (int i=1; i<n; i++) {
if (x(i) - testval > 30 * 60) {
testval = x(i);
res(i) = true;
}
}
return res;
}')
DT[, keep1 := selecttimes(datetime), by = id]
DT[, all(keep == keep1)]
#[1] TRUE
Some additional testing should be done, it needs input validation, and the time difference could be made a parameter.
回答3:
# create an index column
DT[, idx := 1:.N, by = id]
# find the indices of the matching future dates
DT[, fut.idx := DT[.(id = id, datetime = datetime+30*60), on = c('id', 'datetime')
, idx, roll = -Inf]]
# datetime x id keep difftime idx fut.idx
# 1: 2016-04-28 09:20:18 0.02461368 1 TRUE 0.0000000 mins 1 4
# 2: 2016-04-28 09:41:34 0.88953932 1 FALSE 21.2666667 mins 2 6
# 3: 2016-04-28 09:46:07 0.31818101 1 FALSE 25.8166667 mins 3 6
# 4: 2016-04-28 10:00:56 0.14711365 1 TRUE 40.6333333 mins 4 6
# 5: 2016-04-28 10:09:11 0.54406602 1 FALSE 48.8833333 mins 5 7
# 6: 2016-04-28 10:39:09 0.69280341 1 TRUE 78.8500000 mins 6 NA
# 7: 2016-04-28 10:50:01 0.99426978 1 FALSE 89.7166667 mins 7 NA
# 8: 2016-04-28 10:51:46 0.47779597 1 FALSE 91.4666667 mins 8 NA
# 9: 2016-04-28 10:57:58 0.23162579 1 FALSE 97.6666667 mins 9 NA
#10: 2016-04-28 10:58:23 0.96302423 1 FALSE 98.0833333 mins 10 NA
#11: 2016-04-28 09:13:19 0.21640794 2 TRUE 0.0000000 mins 1 4
#12: 2016-04-28 09:13:44 0.70853047 2 FALSE 0.4166667 mins 2 4
#13: 2016-04-28 09:36:44 0.75845954 2 FALSE 23.4166667 mins 3 6
#14: 2016-04-28 09:55:31 0.64050681 2 TRUE 42.2000000 mins 4 8
#15: 2016-04-28 10:00:33 0.90229905 2 FALSE 47.2333333 mins 5 9
#16: 2016-04-28 10:11:51 0.28915974 2 FALSE 58.5333333 mins 6 9
#17: 2016-04-28 10:14:14 0.79546742 2 FALSE 60.9166667 mins 7 9
#18: 2016-04-28 10:26:17 0.69070528 2 TRUE 72.9666667 mins 8 10
#19: 2016-04-28 10:51:02 0.59414202 2 FALSE 97.7166667 mins 9 NA
#20: 2016-04-28 10:56:36 0.65570580 2 TRUE 103.2833333 mins 10 NA
# at this point the problem is "solved", but you still have to extract the solution
# and that's the more complicated part
DT[, keep.new := FALSE]
# iterate over the matching indices (jumping straight to the correct one)
DT[, {
next.idx = 1
while(!is.na(next.idx)) {
set(DT, .I[next.idx], 'keep.new', TRUE)
next.idx = fut.idx[next.idx]
}
}, by = id]
DT[, identical(keep, keep.new)]
#[1] TRUE
Alternatively for the last step, you can do (this will iterate over the entire thing, but I don't know what the speed impact would be):
DT[, keep.3 := FALSE]
DT[DT[, .I[na.omit(Reduce(function(x, y) fut.idx[x], c(1, fut.idx), accumulate = T))]
, by = id]$V1
, keep.3 := TRUE]
DT[, identical(keep, keep.3)]
#[1] TRUE
来源:https://stackoverflow.com/questions/36918158/subset-observations-that-differ-by-at-least-30-minutes-time