How to calculate a table of pairwise counts from long-form data frame

前端未结

关注

 4  1386

I have a \'long-form\' data frame with columns id (the primary key) and featureCode (categorical variable). Each record has between 1 and 9 values

相关标签:

4条回答

一个人的身影

2020-12-06 21:13

If you don't need that exact structure, but just need to get the pairwise counts, you can try this approach:

Here's your data:

dat <- read.table(header = TRUE, 
       text = "id  featureCode
                5         PPLC
                5         PCLI
                6         PPLC
                6         PCLI
                7          PPL
                7         PPLC
                7         PCLI
                8         PPLC
                9         PPLC
               10         PPLC")

We're only interested in ids where there is more than one featureCode:

dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]

Having this data as a list is going to be useful since it will let us use lapply to get the pairwise combinations.

dat2 <- split(dat2$featureCode, dat2$id)

This next step can be broken down into its intermediate sections if you prefer, but the basic idea is to create combinations of the vectors in each list item and then tabulate the unlisted output.

table(unlist(lapply(dat2, function(x) 
  combn(sort(x), 2, FUN = function(y) 
    paste(y, collapse = "+")))))
# 
#  PCLI+PPL PCLI+PPLC  PPL+PPLC 
#         1         3         1

Update: A better answer at another question

With a little bit of modification, @flodel's answer to another question is applicable here. It requires the igraph package to be installed (install.packages("igraph")).

dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
dat2 <- split(dat2$featureCode, dat2$id)
library(igraph)
g <- graph.edgelist(matrix(unlist(lapply(dat2, function(x) 
  combn(as.character(x), 2, simplify = FALSE))), ncol = 2, byrow=TRUE), 
                    directed=FALSE)
get.adjacency(g)
# 3 x 3 sparse Matrix of class "dgCMatrix"
#      PPLC PCLI PPL
# PPLC    .    3   1
# PCLI    3    .   1
# PPL     1    1   .

0 讨论(0)

被撕碎了的回忆

2020-12-06 21:14

Another solution, which is conceptually easy to follow, I think. You have a bipartite graph here, and simply need the projection of this graph onto the "featureCode" vertices. Here is how to do this with the igraph package:

dat <- read.table(header = TRUE, stringsAsFactors=FALSE,
                  text = "id  featureCode                                       
                          5         PPLC                                                  
                          5         PCLI                                                  
                          6         PPLC                                                  
                          6         PCLI                                                  
                          7          PPL                                                  
                          7         PPLC                                                  
                          7         PCLI                                                  
                          8         PPLC                                                  
                          9         PPLC                                                  
                         10         PPLC")

g <- graph.data.frame(dat, vertices=unique(data.frame(c(dat[,1], dat[,2]),
                          type=rep(c(TRUE, FALSE), each=nrow(dat)))))

get.adjacency(bipartite.projection(g)[[1]], attr="weight", sparse=FALSE)

#      PPLC PCLI PPL
# PPLC    0    3   1
# PCLI    3    0   1
# PPL     1    1   0

0 讨论(0)

执念已碎

2020-12-06 21:18

Here is a data.table approach similar to @mrdwab

It will work best if featureCode is a character

library(data.table)

DT <- data.table(dat)
# convert to character
DT[, featureCode := as.character(featureCode)]
# subset those with >1 per id
DT2 <- DT[, N := .N, by = id][N>1]
# create all combinations of 2
# return as a data.table with these as columns `V1` and `V2`
# then count the numbers in each group
DT2[, rbindlist(combn(featureCode,2, 
      FUN = function(x) as.data.table(as.list(x)), simplify = F)), 
    by = id][, .N, by = list(V1,V2)]


     V1   V2 N
1: PPLC PCLI 3
2:  PPL PPLC 1
3:  PPL PCLI 1

0 讨论(0)

臣服心动

2020-12-06 21:20

I would use SQL, in R it is available with the sqldf Package.

Extract all possible combinations something like:

sqldf("select distinct df1.featureCode, df2.featureCode
       from df df1, df df2       
       ")

Then you can extract the result elements:
(Maybe just use a for loop for all combinations)

PCLI - PPLC

sqldf("select count(df1.id)
       from df df1, df df2
       where df1.id = df2.id
       and df1.featureCode = 'PCLI' and df2.featureCode = 'PPLC'
       ")

PPLC - PPL

sqldf("select count(df1.id)
       from df df1, df df2
       where df1.id = df2.id
       and df1.featureCode = 'PPLC' and df2.featureCode = 'PPL'
       ")

PCLI - PPL

sqldf("select count(df1.id)
       from df df1, df df2
       where df1.id = df2.id
       and df1.featureCode = 'PCLI' and df2.featureCode = 'PPL'
       ")

There is for sure some easier solution out there especially if you got more combinations to consider. Maybe a search for contingency table helps you out.

0 讨论(0)