问题
Suppose that I have two sets of identifiers id1
and id2
in a data frame. How can I create a new identifier id3
that works as follows:
I consider id1
as the stricter key, so that observations are first grouped in id1
and then in id2
. If there are two sets of rows with different values of id2
that have some of its elements with the same id1
, these two sets should have the same value for id3
(the exact value in id3
doesn't matter much).
df <- data.frame(id1 = c(1, 1, 2, 2, 5, 6),
id2 = c(4, 3, 1, 2, 2, 7),
id3 = c(1, 1, 2, 2, 2, 3))
Rows 1 and 2 are grouped together because they have the same id1
. Rows 3, 4 and 5 are grouped together because 3 and 4 have the same id1
and 4 and 5 have the same id2
.
Can someone help? I would rather have a solution with dplyr
that encompasses a general case in which there is an arbitrary number of possible values in the id
columns.
回答1:
This is a graph theory problem. Each id1
and id2
is a separate node and df
gives the links between them. You are looking to see which weakly connected clusters each id belongs too.
library(igraph)
df <- df %>% mutate(from = paste0('id1', '_', id1), to = paste0('id2', '_', id2))
dg <- graph_from_data_frame(df %>% select(from, to), directed = FALSE)
df <- df %>% mutate(id3 = components(dg)$membership[from])
df %>% select(id1, id2, id3)
#> id1 id2 id3
#> 1 1 4 1
#> 2 1 3 1
#> 3 2 1 2
#> 4 2 2 2
#> 5 5 2 2
#> 6 6 7 3
来源:https://stackoverflow.com/questions/63908856/how-to-merge-two-different-groupings-if-they-are-not-disjoint-with-dplyr