问题
Suppose that I have two sets of identifiers id1 and id2 in a data frame. How can I create a new identifier id3 that works as follows:
I consider id1 as the stricter key, so that observations are first grouped in id1 and then in id2. If there are two sets of rows with different values of id2 that have some of its elements with the same id1, these two sets should have the same value for id3 (the exact value in id3 doesn't matter much).
df <- data.frame(id1 = c(1, 1, 2, 2, 5, 6),
id2 = c(4, 3, 1, 2, 2, 7),
id3 = c(1, 1, 2, 2, 2, 3))
Rows 1 and 2 are grouped together because they have the same id1. Rows 3, 4 and 5 are grouped together because 3 and 4 have the same id1 and 4 and 5 have the same id2.
Can someone help? I would rather have a solution with dplyr that encompasses a general case in which there is an arbitrary number of possible values in the id columns.
回答1:
This is a graph theory problem. Each id1 and id2 is a separate node and df gives the links between them. You are looking to see which weakly connected clusters each id belongs too.
library(igraph)
df <- df %>% mutate(from = paste0('id1', '_', id1), to = paste0('id2', '_', id2))
dg <- graph_from_data_frame(df %>% select(from, to), directed = FALSE)
df <- df %>% mutate(id3 = components(dg)$membership[from])
df %>% select(id1, id2, id3)
#> id1 id2 id3
#> 1 1 4 1
#> 2 1 3 1
#> 3 2 1 2
#> 4 2 2 2
#> 5 5 2 2
#> 6 6 7 3
来源:https://stackoverflow.com/questions/63908856/how-to-merge-two-different-groupings-if-they-are-not-disjoint-with-dplyr