Match data to nearest time value by id

问题

I have generated a series of hourly time stamps with:

intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by="hour")

> intervals
[1] "2018-01-20 00:00:00 PST" "2018-01-20 01:00:00 PST" "2018-01-20 02:00:00 PST"
[4] "2018-01-20 03:00:00 PST"

Given a dataset with messy and unevenly spaced timestamps, how would one match time values from that dataset to the closest hourly timestamp by id, and remove other timestamps in between? For example:

> test
                         time      id     amount
312   2018-01-20 00:02:14 PST       1 54.9508346
8652  2018-01-20 00:54:41 PST       2 30.5557992
13809 2018-01-20 01:19:27 PST       3 90.5459248
586   2018-01-20 00:03:35 PST       1 79.7635973
9077  2018-01-20 00:56:37 PST       2 75.5356406
21546 2018-01-20 02:25:05 PST       3 36.6017705
7275  2018-01-20 00:47:45 PST       1 12.7618139
12768 2018-01-20 01:15:30 PST       2 72.4465838
1172  2018-01-20 00:08:01 PST       3 81.0468155
24106 2018-01-20 03:04:10 PST       1  0.8615881
14464 2018-01-20 01:25:04 PST       2 49.8718743
15344 2018-01-20 01:29:30 PST       3 85.0054113
14255 2018-01-20 01:23:22 PST       1 34.5093891
21565 2018-01-20 02:25:40 PST       2 69.0175725
15602 2018-01-20 01:31:32 PST       3 61.8602426

Would produce:

> output
             interval id     amount
1 2018-01-20 01:00:00  1 12.7618139
2          2018-01-20  1 54.9508346
3 2018-01-20 03:00:00  1  0.8615881
4 2018-01-20 01:00:00  2 75.5356400
5 2018-01-20 02:00:00  2 69.0175700
6          2018-01-20  3 81.0468200
7 2018-01-20 01:00:00  3 90.5459200
8 2018-01-20 02:00:00  3 36.6017700

I understand that there exists a possible solution in data.table

setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]

with roll = nearest, but how would one keep find the nearest match in intervals for every id in test and retain the amount attribute ?

Any suggestions would be appreciated! Here is the sample data:

 dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST", 
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST", 
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST", 
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST", 
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST", 
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332, 
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574, 
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382, 
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297, 
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L, 
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L, 
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")

回答1:

Another option is to join inside j with data.table:

# convert 'test' to a 'data.table' first with 'setDT'
# and convert the 'time'-column tot a datetime format
setDT(test)[, time := as.POSIXct(time)][]

# preform the join
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]

which gives:

    id                time     amount
 1:  1 2018-01-20 00:00:00 54.9508346
 2:  1 2018-01-20 01:00:00 12.7618139
 3:  1 2018-01-20 02:00:00 34.5093891
 4:  1 2018-01-20 03:00:00  0.8615881
 5:  2 2018-01-20 00:00:00 30.5557992
 6:  2 2018-01-20 01:00:00 75.5356406
 7:  2 2018-01-20 02:00:00 69.0175725
 8:  2 2018-01-20 03:00:00 69.0175725
 9:  3 2018-01-20 00:00:00 81.0468155
10:  3 2018-01-20 01:00:00 90.5459248
11:  3 2018-01-20 02:00:00 36.6017705
12:  3 2018-01-20 03:00:00 36.6017705

In the above approach some amount-values are assigned to more than one time by id. If you don't want that and only want to keep the ones which are the closest to a time you could refine the approach as follows:

test[, r := rowid(id)
     ][, .SD[.(time = intervals)
             , on = .(time)
             , roll = 'nearest'
             , .(time, amount, r, time_diff = abs(x.time - i.time))
             ][, .SD[which.min(time_diff)], by = r]
       , by = id][, c('r','time_diff') := NULL][]

which gives:

    id                time     amount
 1:  1 2018-01-20 00:00:00 54.9508346
 2:  1 2018-01-20 01:00:00 12.7618139
 3:  1 2018-01-20 02:00:00 34.5093891
 4:  1 2018-01-20 03:00:00  0.8615881
 5:  2 2018-01-20 00:00:00 30.5557992
 6:  2 2018-01-20 01:00:00 75.5356406
 7:  2 2018-01-20 02:00:00 69.0175725
 8:  3 2018-01-20 00:00:00 81.0468155
 9:  3 2018-01-20 01:00:00 90.5459248
10:  3 2018-01-20 02:00:00 36.6017705

回答2:

something like this using lubridate?

library(lubridate);library(dplyr)
test$time<-ymd_hms(test$time)
test$HTime=round_date(test$time,unit="hour")
test$DiffTime=abs(test$time-test$HTime)
result=test%>%group_by(id,HTime)%>%summarize(amount=amount[DiffTime==min(DiffTime)])
result


 # A tibble: 8 x 3
# Groups: id [?]
     id HTime               amount
  <dbl> <dttm>               <dbl>
1  1.00 2018-01-20 00:00:00 55.0  
2  1.00 2018-01-20 01:00:00 12.8  
3  1.00 2018-01-20 03:00:00  0.862
4  2.00 2018-01-20 01:00:00 75.5  
5  2.00 2018-01-20 02:00:00 69.0  
6  3.00 2018-01-20 00:00:00 81.0  
7  3.00 2018-01-20 01:00:00 90.5  
8  3.00 2018-01-20 02:00:00 36.6

回答3:

Inspired by @DavidAurenburg solution, a condensed version:

test[, 
    .(amount=amount[which.min(abs(time - round(time, "hour")))]), 
    keyby=.(id, as.character(round(time, "hour")))]

Previous post below didnt match user required output

Maybe you would like to include id in your joins as well. When using nearest, you might get matches from data coming from a few hours ago

output <- test[intervals, on=c("id","time"), roll="nearest"]
setorder(output, id, time)
output
#                    time id     amount
#  1: 2018-01-20 00:00:00  1 54.9508346
#  2: 2018-01-20 01:00:00  1 12.7618139
#  3: 2018-01-20 02:00:00  1 34.5093891
#  4: 2018-01-20 03:00:00  1  0.8615881
#  5: 2018-01-20 00:00:00  2 30.5557992
#  6: 2018-01-20 01:00:00  2 75.5356406
#  7: 2018-01-20 02:00:00  2 69.0175725
#  8: 2018-01-20 03:00:00  2 69.0175725
#  9: 2018-01-20 00:00:00  3 81.0468155
# 10: 2018-01-20 01:00:00  3 90.5459248
# 11: 2018-01-20 02:00:00  3 36.6017705
# 12: 2018-01-20 03:00:00  3 36.6017705

Hope to see a more elegant use of the data.table to solve this.

data:

intervals <- CJ(time=seq(as.POSIXct("2018-01-20 00:00:00"), 
    as.POSIXct("2018-01-20 03:00:00"), 
    by="hour"), id=1:3)

test <- fread("time,id,amount
2018-01-20 00:02:14 PST,1,54.9508346
2018-01-20 00:54:41 PST,2,30.5557992
2018-01-20 01:19:27 PST,3,90.5459248
2018-01-20 00:03:35 PST,1,79.7635973
2018-01-20 00:56:37 PST,2,75.5356406
2018-01-20 02:25:05 PST,3,36.6017705
2018-01-20 00:47:45 PST,1,12.7618139
2018-01-20 01:15:30 PST,2,72.4465838
2018-01-20 00:08:01 PST,3,81.0468155
2018-01-20 03:04:10 PST,1,0.8615881
2018-01-20 01:25:04 PST,2,49.8718743
2018-01-20 01:29:30 PST,3,85.0054113
2018-01-20 01:23:22 PST,1,34.5093891
2018-01-20 02:25:40 PST,2,69.0175725
2018-01-20 01:31:32 PST,3,61.8602426")[,
    time:=as.POSIXct(time)]

来源：https://stackoverflow.com/questions/48457575/match-data-to-nearest-time-value-by-id

标签

dplyr

data.table