I have a \'long-form\' data frame with columns id
(the primary key) and featureCode
(categorical variable). Each record has between 1 and 9 values
If you don't need that exact structure, but just need to get the pairwise counts, you can try this approach:
Here's your data:
dat <- read.table(header = TRUE,
text = "id featureCode
5 PPLC
5 PCLI
6 PPLC
6 PCLI
7 PPL
7 PPLC
7 PCLI
8 PPLC
9 PPLC
10 PPLC")
We're only interested in id
s where there is more than one featureCode
:
dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
Having this data as a list is going to be useful since it will let us use lapply
to get the pairwise combinations.
dat2 <- split(dat2$featureCode, dat2$id)
This next step can be broken down into its intermediate sections if you prefer, but the basic idea is to create combinations of the vectors in each list item and then tabulate the unlisted output.
table(unlist(lapply(dat2, function(x)
combn(sort(x), 2, FUN = function(y)
paste(y, collapse = "+")))))
#
# PCLI+PPL PCLI+PPLC PPL+PPLC
# 1 3 1
With a little bit of modification, @flodel's answer to another question is applicable here. It requires the igraph
package to be installed (install.packages("igraph")
).
dat2 <- dat[ave(dat$id, dat$id, FUN=length) > 1, ]
dat2 <- split(dat2$featureCode, dat2$id)
library(igraph)
g <- graph.edgelist(matrix(unlist(lapply(dat2, function(x)
combn(as.character(x), 2, simplify = FALSE))), ncol = 2, byrow=TRUE),
directed=FALSE)
get.adjacency(g)
# 3 x 3 sparse Matrix of class "dgCMatrix"
# PPLC PCLI PPL
# PPLC . 3 1
# PCLI 3 . 1
# PPL 1 1 .
Another solution, which is conceptually easy to follow, I think. You have a bipartite graph here, and simply need the projection of this graph onto the "featureCode" vertices. Here is how to do this with the igraph package:
dat <- read.table(header = TRUE, stringsAsFactors=FALSE,
text = "id featureCode
5 PPLC
5 PCLI
6 PPLC
6 PCLI
7 PPL
7 PPLC
7 PCLI
8 PPLC
9 PPLC
10 PPLC")
g <- graph.data.frame(dat, vertices=unique(data.frame(c(dat[,1], dat[,2]),
type=rep(c(TRUE, FALSE), each=nrow(dat)))))
get.adjacency(bipartite.projection(g)[[1]], attr="weight", sparse=FALSE)
# PPLC PCLI PPL
# PPLC 0 3 1
# PCLI 3 0 1
# PPL 1 1 0
Here is a data.table
approach similar to @mrdwab
It will work best if featureCode
is a character
library(data.table)
DT <- data.table(dat)
# convert to character
DT[, featureCode := as.character(featureCode)]
# subset those with >1 per id
DT2 <- DT[, N := .N, by = id][N>1]
# create all combinations of 2
# return as a data.table with these as columns `V1` and `V2`
# then count the numbers in each group
DT2[, rbindlist(combn(featureCode,2,
FUN = function(x) as.data.table(as.list(x)), simplify = F)),
by = id][, .N, by = list(V1,V2)]
V1 V2 N
1: PPLC PCLI 3
2: PPL PPLC 1
3: PPL PCLI 1
I would use SQL, in R it is available with the sqldf Package.
Extract all possible combinations something like:
sqldf("select distinct df1.featureCode, df2.featureCode
from df df1, df df2
")
Then you can extract the result elements:
(Maybe just use a for loop for all combinations)
PCLI - PPLC
sqldf("select count(df1.id)
from df df1, df df2
where df1.id = df2.id
and df1.featureCode = 'PCLI' and df2.featureCode = 'PPLC'
")
PPLC - PPL
sqldf("select count(df1.id)
from df df1, df df2
where df1.id = df2.id
and df1.featureCode = 'PPLC' and df2.featureCode = 'PPL'
")
PCLI - PPL
sqldf("select count(df1.id)
from df df1, df df2
where df1.id = df2.id
and df1.featureCode = 'PCLI' and df2.featureCode = 'PPL'
")
There is for sure some easier solution out there especially if you got more combinations to consider. Maybe a search for contingency table helps you out.