R – How to join two data frames by nearest time-date?

前端未结

关注

 2  1518

栀梦 2020-12-03 06:30

I have 2 data sets, each containing a date-time value in POSIXlt format, and some other numeric and character variables.

I want to combine both data sets based on th

2条回答

难免孤独 (楼主)

2020-12-03 06:43

I had a similar problem, but instead of using data.table or tidyverse I created my own function amerge for "approximate merge". It takes 4 arguments:

two data frames,
a vector of column names for "firm" (not approximate) merge - these must exist in both data frames,
and the name of a single column (in both data frames) for approximate merge. It will work for any numeric values, including dates.

The idea was to merge rows 1-to-1 of best matches, and not loose any rows from any data frame. Here is my commented code with a working example.

amerge <- function(d1, d2, firm=NULL, approx=NULL) {
  rt = Sys.time()

  # Take care of conflicting column names
  n2 = data.frame(oldname = names(d2), newname = names(d2))
  n2$newname = as.character(n2$newname)
  n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
    paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")

  # Add unique row IDs
  if (length(c(firm, approx))>1) {
    d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
    d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
  } else {
    d1$ID1 = factor(d1[,c(approx,firm)])
    d2$ID2 = factor(d2[,c(approx,firm)])
  }

  # Perform initial merge on the 'firm' parameters, if any
  # Otherwise match all to all
  if (length(firm)>0) {
    t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
  } else {
    names(d2)= c(n2$newname,"ID2")
    t1 = data.frame()
    for (i1 in 1:nrow(d1)) {
      trow = d1[i1,]
      t1 = rbind(t1, cbind(trow, d2))
    }
  }

  # Match by the most approximate record
  if (length(approx)==1) {
    # Calculate the differential for approximate merging
    t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
    # Sort data by ascending DIFF, so that best matching records are used first
    t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
    t2 = data.frame()
    d2$used = 0
    # For each record of d1, find match from d2
    for (i1 in na.omit(unique(t1$ID1))) {
      tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
      # If there are non-missing records, get the one with minimum DIFF (top one)
      if (nrow(tx)>0) {
        tx = tx[1,]
        # If matching record found, remove it from the pool, so it's not used again
        t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
        # And mark it as used
        d2$used[d2$ID2==tx$ID2] = 1
      } else {
        # If there are no non-missing records, just get the first one from the top
        tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
      }
      t2 = rbind(t2,tx)
    }
  } else {
    t2 = t1
  }
  # Make the records the same order as d1
  t2 = t2[match(d1$ID1, t2$ID1),]
  # Add unmatched records from d2 to the end of output
  if (any(d2$used==0)) {
    tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
    tx = tx[!duplicated(tx$ID2),]
    tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
    t2 = rbind(t2,tx)
    t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
  }
  t2$DIFF = t2$ID1 = t2$ID2 = NULL
  cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
  return(t2)
}

And the example:

new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))

amerge(old, new, firm="ID", approx="date")

It outputs:

   ID                date  old              date.2  new
2   1 2016-03-07 12:20:00    a 2016-03-07 12:20:00    u
6   1 2016-04-02 12:20:00    b 2016-04-02 12:20:00    v
7   1 2016-03-01 10:09:00    c                 
10  1 2015-04-12 10:09:00    d                 
13  1 2016-03-03 12:20:00    e 2016-03-02 12:20:00    t
16  2 2016-04-12 11:03:00  2016-04-12 11:03:00    x

So works for my purpose as intended - there is exactly one copy of each row from both data frames - matched by shortest time difference. One note: the function copies date.2 into date column where the date would be missing.

0 讨论(0)

查看其它2个回答