Merge Records Over Time Interval

我的未来我决定 提交于 2019-11-28 10:30:46
G. Grothendieck

Set up data

First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):

LinesA <- "OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal"

LinesB <- "OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00"

A <- At <- read.table(textConnection(LinesA), header = TRUE, 
               colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE, 
               colClasses = c("numeric", rep("character", 2)))

# in At and Bt convert times columns to "times" class

library(chron) 

At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)

sqldf with times class

Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:

library(sqldf)

out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
   where Time between StartTime and StartTime + Duration",
   method = "raw")

out$Time <- times(as.numeric(out$Time))

The result is:

> out
      OBS ID     Time Outcome
1   1 01 10:12:10  Normal
2   3 02 10:12:45   Weird

With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:

library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver 
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
    where Time between StartTime and StartTime + Duration")

sqldf with character class

Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:

sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
    where strftime('%s', Time) - strftime('%s', StartTime)
       between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")

EDIT:

A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.

EDIT:

Simplified/improved final sqldf statement.

here is an example:

# first, merge by ID
z <- merge(A[, -1], B, by = "ID")

# convert string to POSIX time
z <- transform(z,
  s_t = as.numeric(strptime(as.character(z$StartTime), "%H:%M:%S")),
  dur = as.numeric(strptime(as.character(z$Duration), "%H:%M:%S")) - 
    as.numeric(strptime("00:00:00", "%H:%M:%S")),
  tim = as.numeric(strptime(as.character(z$Time), "%H:%M:%S")))

# subset by time range
subset(z, s_t < tim & tim < s_t + dur)

the output:

  ID StartTime Duration Outcome OBS     Time        s_t dur        tim
1  1  10:12:06 00:00:10  Normal   1 10:12:10 1321665126  10 1321665130
2  1  10:12:06 00:00:10  Normal   2 10:12:15 1321665126  10 1321665135
7  2  10:12:30 00:00:30   Weird   3 10:12:45 1321665150  30 1321665165

OBS #2 looks to be in the range. does it make sense?

Merge the two data.frames together with merge(). Then subset() the resulting data.frame with the condition time >= startTime & time <= startTime + Duration or whatever rules make sense to you.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!