Count every possible pair of values in a column grouped by multiple columns

前端 未结 7 2024
不知归路
不知归路 2020-12-03 15:59

I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)

> head(dat)

chr   start  end    enhancer motif 
chr10          


        
7条回答
  •  清歌不尽
    2020-12-03 16:41

    Updated: Here is a fast and memory efficient version using data.table:

    • Step 1: Construct sample data of your dimensions approximately:

      require(data.table) ## 1.9.4+
      set.seed(1L)        ## For reproducibility
      N = 2724098L
      motif = sample(paste("motif", 1:1716, sep="_"), N, TRUE)
      id = sample(83509, N, TRUE)
      DT = data.table(id, motif)
      
    • Step 2: Pre-processing:

      DT = unique(DT) ## IMPORTANT: not to have duplicate motifs within same id
      setorder(DT)    ## IMPORTANT: motifs are ordered within id as well
      setkey(DT, id)  ## reset key to 'id'. Motifs ordered within id from previous step
      DT[, runlen := .I]
      
    • Step 3: Solution:

      ans = DT[DT, {
                    tmp = runlen < i.runlen; 
                    list(motif[tmp], i.motif[any(tmp)])
                   }, 
            by=.EACHI][, .N, by="V1,V2"]
      

      This takes ~27 seconds and ~1GB of memory during the final step 3.

    The idea is to perform a self-join, but make use of data.table's by=.EACHI feature, which evaluates the j-expression for each i, and therefore memory efficient. And the j-expression makes sure that we only obtain the entry "motif_a, motif_b" and not the redundant "motif_b,motif_a". This saves computation time and memory as well. And the binary search is quite fast, even though there are 87K+ ids. Finally we aggregate by the motif combinations to get the number of rows in each of them - which is what you require.

    HTH

    PS: See revision for the older (+ slower) version.

提交回复
热议问题