Fast way to group variables based on direct and indirect similarities in multiple columns

后端未结

关注

 2  1780

忘了有多久 2020-12-31 08:23

I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A

2条回答

抹茶落季 (楼主)

2020-12-31 09:20

I think this recursive approach does what you want. Basically, it performs a self-join on each column, one at a time, and if more than one row is matched (i.e. rows other than the row being considered), it saves all unique ids from the match. It avoids using the rows with NA by leveraging secondary indices. The trick is that we do the recursion twice, once with ids, and again but with the newly created new_ids.

dt[, new_id := .(list(character()))]

get_ids <- function(matched_ids, new_id) {
  if (length(matched_ids) > 1L) {
    list(unique(
      c(new_id[[1L]], unlist(matched_ids))
    ))
  } else {
    new_id
  }
}

find_recursively <- function(dt, cols, pass) {
  if (length(cols) == 0L) return(invisible())

  current <- cols[1L]
  next_cols <- cols[-1L]

  next_dt <- switch(
    pass,

    first = dt[!list(NA_character_),
               new_id := dt[.SD, .(get_ids(x.id, i.new_id)), on = current, by = .EACHI]$V1,
               on = current],

    second = dt[!list(NA_character_),
                new_id := dt[.SD, .(get_ids(x.new_id, i.new_id)), on = current, by = .EACHI]$V1,
                on = current]
  )

  find_recursively(next_dt, next_cols, pass)
}

find_recursively(dt, paste0("s", 1:4), "first")
find_recursively(dt, paste0("s", 1:4), "second")

dt[, new_id := sapply(new_id, function(nid) {
  ids <- unlist(nid)
  if (length(ids) == 0L) {
    NA_character_
  } else {
    paste(ids, collapse = "|")
  }
})]

print(dt)
    id   s1   s2 s3   s4   new_id
 1: a1    a    d  f    h a1|b3|c7
 2: b3    b    d  g    i a1|b3|c7
 3: c7    c    e  f    j a1|c7|b3
 4: d5    l    k  l    m    d5|e3
 5: e3    l    k  l    m    d5|e3
 6: f4    o    o  s    o f4|g2|h1
 7: g2    o    o  r    o f4|g2|h1
 8: h1    o    o  u    o f4|g2|h1
 9: i9    w      
10: j6    z

The join uses this idiom.

0 讨论(0)

查看其它2个回答