How to calculate a table of pairwise counts from long-form data frame

前端 未结 4 1389
闹比i
闹比i 2020-12-06 20:20

I have a \'long-form\' data frame with columns id (the primary key) and featureCode (categorical variable). Each record has between 1 and 9 values

4条回答
  •  一个人的身影
    2020-12-06 21:13

    If you don't need that exact structure, but just need to get the pairwise counts, you can try this approach:

    Here's your data:

    dat <- read.table(header = TRUE, 
           text = "id  featureCode
                    5         PPLC
                    5         PCLI
                    6         PPLC
                    6         PCLI
                    7          PPL
                    7         PPLC
                    7         PCLI
                    8         PPLC
                    9         PPLC
                   10         PPLC")
    

    We're only interested in ids where there is more than one featureCode:

    dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
    

    Having this data as a list is going to be useful since it will let us use lapply to get the pairwise combinations.

    dat2 <- split(dat2$featureCode, dat2$id)
    

    This next step can be broken down into its intermediate sections if you prefer, but the basic idea is to create combinations of the vectors in each list item and then tabulate the unlisted output.

    table(unlist(lapply(dat2, function(x) 
      combn(sort(x), 2, FUN = function(y) 
        paste(y, collapse = "+")))))
    # 
    #  PCLI+PPL PCLI+PPLC  PPL+PPLC 
    #         1         3         1
    

    Update: A better answer at another question

    With a little bit of modification, @flodel's answer to another question is applicable here. It requires the igraph package to be installed (install.packages("igraph")).

    dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
    dat2 <- split(dat2$featureCode, dat2$id)
    library(igraph)
    g <- graph.edgelist(matrix(unlist(lapply(dat2, function(x) 
      combn(as.character(x), 2, simplify = FALSE))), ncol = 2, byrow=TRUE), 
                        directed=FALSE)
    get.adjacency(g)
    # 3 x 3 sparse Matrix of class "dgCMatrix"
    #      PPLC PCLI PPL
    # PPLC    .    3   1
    # PCLI    3    .   1
    # PPL     1    1   .
    

提交回复
热议问题