Find unique set of identifiers/groups among several columns

≯℡__Kan透↙ 提交于 2019-12-11 07:49:11

问题


I have data with two (potentially more) columns of identifiers (typically long strings). These differ sometimes, are mistyped, or change over time. I want to identify unique subjects in the data. This requires identifying groups of cases which are connected via their ids at some level.

An example

df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c(1, 1, 3, 4, 4, 7),
                 trueid = c("id1", "id1", "id2", "id2", "id2", "id3"))
> df
  ida idb trueid
1   A   1    id1
2   B   1    id1
3   C   3    id2
4   C   4    id2
5   D   4    id2
6   E   7    id3

Identifiers for id1 are "A", "B", 1, for id2 "C", "D", 3, 4 and for id3 "E", 7.

I don't know trueid but need to find it using the information from columns ida and idb.

The solution needs to scale to millions of observations with tens of thousands of unique ids. I am already using data.table.

Extension: There is another scenario where there are more than two columns, and some columns might be informative for others, i.e. have the same identifiers. I do not know which columns are informative for which. I think type can be disregarded though, all columns are strings or can safely be converted.

Another example for this:

df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c("1", "2", "3", "4", "4", "7"),
                 idc = c("1", "1", "2", "3", "4", "5"),
                 idd = c("1", "A", "2", "3", "4", "5"),
                 trueid = c("id1", "id1", "id1", "id1", "id1", "id2"))
> df
  ida idb idc idd trueid
1   A   1   1   1    id1
2   B   2   1   A    id1
3   C   3   2   2    id1
4   C   4   3   3    id1
5   D   4   4   4    id1
6   E   7   5   5    id2

Edit: As a commenter pointed out, this is essentially the clique problem of finding complete subgraphs in a graph. After reading a bit more, I understand this issue can be solved with library(igraph). I am leaving the question open as I would prefer a solution that relies on base, data.table or dplyr. I cannot easily install packages on the server I am using, installing igraph involves dealing with a lot of red tape and delays.

Edit2: For anybody reading this and facing a similar problem: zx8754's answer using igraph is considerably (several orders of magnitude) faster on larger (simulated) data with more groups. If you have the chance to use igraph, do so.


回答1:


Here is a recursive approach using data.table:

#convert into a long format for easier processing
mDT <- melt(DT[, rn := .I], id.var="rn", variable.name="V", value.name="ID")[,
    tid := NA_integer_]

#the recursive function
link <- function(ids, label) {
    #identify the rows in DT containing ids and extract the IDs
    newids <- mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), allow.cartesian=TRUE,
        unique(ID)]

    #update those rows to the same group
    mDT[mDT[.(ID=ids), on=.(ID), .(rn=rn)], on=.(rn), tid := label]

    if (length(setdiff(newids, ids)) > 0L) {
        #call the recursive function if there are new ids
        link(newids, label)
    }
}

#get the first id that is not labelled yet
id <- mDT[is.na(tid), ID[1L]]
grp <- 1L
while(!is.na(id)) {
    #use recursive function to link them up
    link(id, grp)

    #repeat for next id that is not part of any group yet
    id <- mDT[is.na(tid), ID[1L]]
    grp <- grp + 1L
}

#update original DT with tid
DT[mDT, on=.(rn), tid := tid]

data:

library(data.table)
DT <- data.table(ida = c("A", "B", "C", "C", "D", "E"),
    idb = c("1", "2", "3", "4", "4", "7"),
    idc = c("1", "1", "2", "3", "4", "5"),
    idd = c("1", "A", "2", "3", "4", "5"))



回答2:


Using igraph:

# example input, I removed "trueid" column
df <- data.frame(ida = c("A", "B", "C", "C", "D", "E"),
                 idb = c("1", "2", "3", "4", "4", "7"),
                 idc = c("1", "1", "2", "3", "4", "5"),
                 idd = c("1", "A", "2", "3", "4", "5"))
#trueid = c("id1", "id1", "id1", "id1", "id1", "id2")

library(igraph)

# set up connections
# Improved version suggested by @thelatemail in the comments
x <- cbind(df[ 1 ], unlist(df[ -1 ]))

# original clumsy version (do not use)
# x <- unique(do.call(rbind, lapply(1:(ncol(df) - 1), function(i) setNames(df[, c(i, i + 1) ], c("from", "to")))))

# convert to graph object
g <- graph_from_data_frame(x)        

# plot if you wish to visualise
plot(g)

# this is the solution, add membership ids to original input dataframe
merge(df, data.frame(grp = clusters(g)$membership),
      by.x = "ida", by.y = 0)
#   ida idb idc idd grp
# 1   A   1   1   1   1
# 2   B   2   1   A   1
# 3   C   3   2   2   1
# 4   C   4   3   3   1
# 5   D   4   4   4   1
# 6   E   7   5   5   2


来源:https://stackoverflow.com/questions/56548395/find-unique-set-of-identifiers-groups-among-several-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!