Fast way to group variables based on direct and indirect similarities in multiple columns

后端未结

关注

 2  1775

I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A

相关标签:

2条回答

抹茶落季

2020-12-31 09:20

I think this recursive approach does what you want. Basically, it performs a self-join on each column, one at a time, and if more than one row is matched (i.e. rows other than the row being considered), it saves all unique ids from the match. It avoids using the rows with NA by leveraging secondary indices. The trick is that we do the recursion twice, once with ids, and again but with the newly created new_ids.

dt[, new_id := .(list(character()))]

get_ids <- function(matched_ids, new_id) {
  if (length(matched_ids) > 1L) {
    list(unique(
      c(new_id[[1L]], unlist(matched_ids))
    ))
  } else {
    new_id
  }
}

find_recursively <- function(dt, cols, pass) {
  if (length(cols) == 0L) return(invisible())

  current <- cols[1L]
  next_cols <- cols[-1L]

  next_dt <- switch(
    pass,

    first = dt[!list(NA_character_),
               new_id := dt[.SD, .(get_ids(x.id, i.new_id)), on = current, by = .EACHI]$V1,
               on = current],

    second = dt[!list(NA_character_),
                new_id := dt[.SD, .(get_ids(x.new_id, i.new_id)), on = current, by = .EACHI]$V1,
                on = current]
  )

  find_recursively(next_dt, next_cols, pass)
}

find_recursively(dt, paste0("s", 1:4), "first")
find_recursively(dt, paste0("s", 1:4), "second")

dt[, new_id := sapply(new_id, function(nid) {
  ids <- unlist(nid)
  if (length(ids) == 0L) {
    NA_character_
  } else {
    paste(ids, collapse = "|")
  }
})]

print(dt)
    id   s1   s2 s3   s4   new_id
 1: a1    a    d  f    h a1|b3|c7
 2: b3    b    d  g    i a1|b3|c7
 3: c7    c    e  f    j a1|c7|b3
 4: d5    l    k  l    m    d5|e3
 5: e3    l    k  l    m    d5|e3
 6: f4    o    o  s    o f4|g2|h1
 7: g2    o    o  r    o f4|g2|h1
 8: h1    o    o  u    o f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>
10: j6 <NA> <NA>  z <NA>     <NA>

The join uses this idiom.

0 讨论(0)

名媛妹妹

2020-12-31 09:23

You may approach this as a network problem. Here I use functions from the igraph package. The basic steps:

meltthe data to long format.
Use graph_from_data_frame to create a graph, where 'id' and 'value' columns are treated as an edge list.
Use components to get connected components of the graph, i.e. which 'id' are connected via their criteria, directly or indirectly.
Select the membership element to get "the cluster id to which each vertex belongs".
Join membership to original data.
Concatenate 'id' grouped by cluster membership.

library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

If desired, concatenate 'id' by 'mem' column (for non-NA 'mem') (IMHO this just makes further data manipulation more difficult ;) ). Anyway, here we go:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>

A basic plot of the graph in this small example, just to illustrate the connected components:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

0 讨论(0)