I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)
> head(dat)
chr start end enhancer motif
chr10
Updated: Here is a fast and memory efficient version using data.table:
Step 1: Construct sample data of your dimensions approximately:
require(data.table) ## 1.9.4+
set.seed(1L) ## For reproducibility
N = 2724098L
motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
id = sample(83509, N, TRUE)
DT = data.table(id, motif)
Step 2: Pre-processing:
DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
setorder(DT) ## IMPORTANT: motifs are ordered within id as well
setkey(DT, id) ## reset key to 'id'. Motifs ordered within id from previous step
DT[, runlen := .I]
Step 3: Solution:
ans = DT[DT, {
tmp = runlen < i.runlen;
list(motif[tmp], i.motif[any(tmp)])
},
by=.EACHI][, .N, by="V1,V2"]
This takes ~27 seconds and ~1GB of memory during the final step 3.
The idea is to perform a self-join, but make use of data.table's by=.EACHI feature, which evaluates the j-expression for each i, and therefore memory efficient. And the j-expression makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.
HTH
PS: See revision for the older (+ slower) version.