How to calculate a table of pairwise counts from long-form data frame

前端 未结 4 1377
闹比i
闹比i 2020-12-06 20:20

I have a \'long-form\' data frame with columns id (the primary key) and featureCode (categorical variable). Each record has between 1 and 9 values

相关标签:
4条回答
  • 2020-12-06 21:13

    If you don't need that exact structure, but just need to get the pairwise counts, you can try this approach:

    Here's your data:

    dat <- read.table(header = TRUE, 
           text = "id  featureCode
                    5         PPLC
                    5         PCLI
                    6         PPLC
                    6         PCLI
                    7          PPL
                    7         PPLC
                    7         PCLI
                    8         PPLC
                    9         PPLC
                   10         PPLC")
    

    We're only interested in ids where there is more than one featureCode:

    dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
    

    Having this data as a list is going to be useful since it will let us use lapply to get the pairwise combinations.

    dat2 <- split(dat2$featureCode, dat2$id)
    

    This next step can be broken down into its intermediate sections if you prefer, but the basic idea is to create combinations of the vectors in each list item and then tabulate the unlisted output.

    table(unlist(lapply(dat2, function(x) 
      combn(sort(x), 2, FUN = function(y) 
        paste(y, collapse = "+")))))
    # 
    #  PCLI+PPL PCLI+PPLC  PPL+PPLC 
    #         1         3         1
    

    Update: A better answer at another question

    With a little bit of modification, @flodel's answer to another question is applicable here. It requires the igraph package to be installed (install.packages("igraph")).

    dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
    dat2 <- split(dat2$featureCode, dat2$id)
    library(igraph)
    g <- graph.edgelist(matrix(unlist(lapply(dat2, function(x) 
      combn(as.character(x), 2, simplify = FALSE))), ncol = 2, byrow=TRUE), 
                        directed=FALSE)
    get.adjacency(g)
    # 3 x 3 sparse Matrix of class "dgCMatrix"
    #      PPLC PCLI PPL
    # PPLC    .    3   1
    # PCLI    3    .   1
    # PPL     1    1   .
    
    0 讨论(0)
  • 2020-12-06 21:14

    Another solution, which is conceptually easy to follow, I think. You have a bipartite graph here, and simply need the projection of this graph onto the "featureCode" vertices. Here is how to do this with the igraph package:

    dat <- read.table(header = TRUE, stringsAsFactors=FALSE,
                      text = "id  featureCode                                       
                              5         PPLC                                                  
                              5         PCLI                                                  
                              6         PPLC                                                  
                              6         PCLI                                                  
                              7          PPL                                                  
                              7         PPLC                                                  
                              7         PCLI                                                  
                              8         PPLC                                                  
                              9         PPLC                                                  
                             10         PPLC")
    
    g <- graph.data.frame(dat, vertices=unique(data.frame(c(dat[,1], dat[,2]),
                              type=rep(c(TRUE, FALSE), each=nrow(dat)))))
    
    get.adjacency(bipartite.projection(g)[[1]], attr="weight", sparse=FALSE)
    
    #      PPLC PCLI PPL
    # PPLC    0    3   1
    # PCLI    3    0   1
    # PPL     1    1   0
    
    0 讨论(0)
  • 2020-12-06 21:18

    Here is a data.table approach similar to @mrdwab

    It will work best if featureCode is a character

    library(data.table)
    
    DT <- data.table(dat)
    # convert to character
    DT[, featureCode := as.character(featureCode)]
    # subset those with >1 per id
    DT2 <- DT[, N := .N, by = id][N>1]
    # create all combinations of 2
    # return as a data.table with these as columns `V1` and `V2`
    # then count the numbers in each group
    DT2[, rbindlist(combn(featureCode,2, 
          FUN = function(x) as.data.table(as.list(x)), simplify = F)), 
        by = id][, .N, by = list(V1,V2)]
    
    
         V1   V2 N
    1: PPLC PCLI 3
    2:  PPL PPLC 1
    3:  PPL PCLI 1
    
    0 讨论(0)
  • 2020-12-06 21:20

    I would use SQL, in R it is available with the sqldf Package.

    Extract all possible combinations something like:

    sqldf("select distinct df1.featureCode, df2.featureCode
           from df df1, df df2       
           ")
    

    Then you can extract the result elements:
    (Maybe just use a for loop for all combinations)

    PCLI - PPLC

    sqldf("select count(df1.id)
           from df df1, df df2
           where df1.id = df2.id
           and df1.featureCode = 'PCLI' and df2.featureCode = 'PPLC'
           ")
    

    PPLC - PPL

    sqldf("select count(df1.id)
           from df df1, df df2
           where df1.id = df2.id
           and df1.featureCode = 'PPLC' and df2.featureCode = 'PPL'
           ")
    

    PCLI - PPL

    sqldf("select count(df1.id)
           from df df1, df df2
           where df1.id = df2.id
           and df1.featureCode = 'PCLI' and df2.featureCode = 'PPL'
           ")
    

    There is for sure some easier solution out there especially if you got more combinations to consider. Maybe a search for contingency table helps you out.

    0 讨论(0)
提交回复
热议问题