问题
I have generated a series of hourly time stamps with:
intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by="hour")
> intervals
[1] "2018-01-20 00:00:00 PST" "2018-01-20 01:00:00 PST" "2018-01-20 02:00:00 PST"
[4] "2018-01-20 03:00:00 PST"
Given a dataset with messy and unevenly spaced timestamps, how would one match time values from that dataset to the closest hourly timestamp by id
, and remove other timestamps in between? For example:
> test
time id amount
312 2018-01-20 00:02:14 PST 1 54.9508346
8652 2018-01-20 00:54:41 PST 2 30.5557992
13809 2018-01-20 01:19:27 PST 3 90.5459248
586 2018-01-20 00:03:35 PST 1 79.7635973
9077 2018-01-20 00:56:37 PST 2 75.5356406
21546 2018-01-20 02:25:05 PST 3 36.6017705
7275 2018-01-20 00:47:45 PST 1 12.7618139
12768 2018-01-20 01:15:30 PST 2 72.4465838
1172 2018-01-20 00:08:01 PST 3 81.0468155
24106 2018-01-20 03:04:10 PST 1 0.8615881
14464 2018-01-20 01:25:04 PST 2 49.8718743
15344 2018-01-20 01:29:30 PST 3 85.0054113
14255 2018-01-20 01:23:22 PST 1 34.5093891
21565 2018-01-20 02:25:40 PST 2 69.0175725
15602 2018-01-20 01:31:32 PST 3 61.8602426
Would produce:
> output
interval id amount
1 2018-01-20 01:00:00 1 12.7618139
2 2018-01-20 1 54.9508346
3 2018-01-20 03:00:00 1 0.8615881
4 2018-01-20 01:00:00 2 75.5356400
5 2018-01-20 02:00:00 2 69.0175700
6 2018-01-20 3 81.0468200
7 2018-01-20 01:00:00 3 90.5459200
8 2018-01-20 02:00:00 3 36.6017700
I understand that there exists a possible solution in data.table
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
with roll = nearest
, but how would one keep find the nearest match in intervals
for every id
in test
and retain the amount
attribute ?
Any suggestions would be appreciated! Here is the sample data:
dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST",
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST",
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST",
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST",
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST",
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332,
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574,
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382,
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297,
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L,
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L,
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")
回答1:
Another option is to join inside j
with data.table
:
# convert 'test' to a 'data.table' first with 'setDT'
# and convert the 'time'-column tot a datetime format
setDT(test)[, time := as.POSIXct(time)][]
# preform the join
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]
which gives:
id time amount 1: 1 2018-01-20 00:00:00 54.9508346 2: 1 2018-01-20 01:00:00 12.7618139 3: 1 2018-01-20 02:00:00 34.5093891 4: 1 2018-01-20 03:00:00 0.8615881 5: 2 2018-01-20 00:00:00 30.5557992 6: 2 2018-01-20 01:00:00 75.5356406 7: 2 2018-01-20 02:00:00 69.0175725 8: 2 2018-01-20 03:00:00 69.0175725 9: 3 2018-01-20 00:00:00 81.0468155 10: 3 2018-01-20 01:00:00 90.5459248 11: 3 2018-01-20 02:00:00 36.6017705 12: 3 2018-01-20 03:00:00 36.6017705
In the above approach some amount
-values are assigned to more than one time
by id
. If you don't want that and only want to keep the ones which are the closest to a time
you could refine the approach as follows:
test[, r := rowid(id)
][, .SD[.(time = intervals)
, on = .(time)
, roll = 'nearest'
, .(time, amount, r, time_diff = abs(x.time - i.time))
][, .SD[which.min(time_diff)], by = r]
, by = id][, c('r','time_diff') := NULL][]
which gives:
id time amount 1: 1 2018-01-20 00:00:00 54.9508346 2: 1 2018-01-20 01:00:00 12.7618139 3: 1 2018-01-20 02:00:00 34.5093891 4: 1 2018-01-20 03:00:00 0.8615881 5: 2 2018-01-20 00:00:00 30.5557992 6: 2 2018-01-20 01:00:00 75.5356406 7: 2 2018-01-20 02:00:00 69.0175725 8: 3 2018-01-20 00:00:00 81.0468155 9: 3 2018-01-20 01:00:00 90.5459248 10: 3 2018-01-20 02:00:00 36.6017705
回答2:
something like this using lubridate?
library(lubridate);library(dplyr)
test$time<-ymd_hms(test$time)
test$HTime=round_date(test$time,unit="hour")
test$DiffTime=abs(test$time-test$HTime)
result=test%>%group_by(id,HTime)%>%summarize(amount=amount[DiffTime==min(DiffTime)])
result
# A tibble: 8 x 3
# Groups: id [?]
id HTime amount
<dbl> <dttm> <dbl>
1 1.00 2018-01-20 00:00:00 55.0
2 1.00 2018-01-20 01:00:00 12.8
3 1.00 2018-01-20 03:00:00 0.862
4 2.00 2018-01-20 01:00:00 75.5
5 2.00 2018-01-20 02:00:00 69.0
6 3.00 2018-01-20 00:00:00 81.0
7 3.00 2018-01-20 01:00:00 90.5
8 3.00 2018-01-20 02:00:00 36.6
回答3:
Inspired by @DavidAurenburg solution, a condensed version:
test[,
.(amount=amount[which.min(abs(time - round(time, "hour")))]),
keyby=.(id, as.character(round(time, "hour")))]
Previous post below didnt match user required output
Maybe you would like to include id in your joins as well. When using nearest, you might get matches from data coming from a few hours ago
output <- test[intervals, on=c("id","time"), roll="nearest"]
setorder(output, id, time)
output
# time id amount
# 1: 2018-01-20 00:00:00 1 54.9508346
# 2: 2018-01-20 01:00:00 1 12.7618139
# 3: 2018-01-20 02:00:00 1 34.5093891
# 4: 2018-01-20 03:00:00 1 0.8615881
# 5: 2018-01-20 00:00:00 2 30.5557992
# 6: 2018-01-20 01:00:00 2 75.5356406
# 7: 2018-01-20 02:00:00 2 69.0175725
# 8: 2018-01-20 03:00:00 2 69.0175725
# 9: 2018-01-20 00:00:00 3 81.0468155
# 10: 2018-01-20 01:00:00 3 90.5459248
# 11: 2018-01-20 02:00:00 3 36.6017705
# 12: 2018-01-20 03:00:00 3 36.6017705
Hope to see a more elegant use of the data.table to solve this.
data:
intervals <- CJ(time=seq(as.POSIXct("2018-01-20 00:00:00"),
as.POSIXct("2018-01-20 03:00:00"),
by="hour"), id=1:3)
test <- fread("time,id,amount
2018-01-20 00:02:14 PST,1,54.9508346
2018-01-20 00:54:41 PST,2,30.5557992
2018-01-20 01:19:27 PST,3,90.5459248
2018-01-20 00:03:35 PST,1,79.7635973
2018-01-20 00:56:37 PST,2,75.5356406
2018-01-20 02:25:05 PST,3,36.6017705
2018-01-20 00:47:45 PST,1,12.7618139
2018-01-20 01:15:30 PST,2,72.4465838
2018-01-20 00:08:01 PST,3,81.0468155
2018-01-20 03:04:10 PST,1,0.8615881
2018-01-20 01:25:04 PST,2,49.8718743
2018-01-20 01:29:30 PST,3,85.0054113
2018-01-20 01:23:22 PST,1,34.5093891
2018-01-20 02:25:40 PST,2,69.0175725
2018-01-20 01:31:32 PST,3,61.8602426")[,
time:=as.POSIXct(time)]
来源:https://stackoverflow.com/questions/48457575/match-data-to-nearest-time-value-by-id