Count every possible pair of values in a column grouped by multiple columns

前端 未结 7 2012
不知归路
不知归路 2020-12-03 15:59

I have a dataframe that looks like this (this is just a subset, actually dataset has 2724098 rows)

> head(dat)

chr   start  end    enhancer motif 
chr10          


        
相关标签:
7条回答
  • 2020-12-03 16:54

    ...if this isn't what you want, I'm giving up. Obviously it isn't optimized for a large data set. This is just a general algorithm that takes natural advantage of R. There are several improvements possible, e.g. with dplyr and data.table. The latter will greatly speed up the [ and %in% operations here.

    motif_pairs <- combn(unique(dat$motif), 2)
    colnames(motif_pairs) <- apply(motif_pairs, 2, paste, collapse = " ")
    motif_pair_counts <- apply(motif_pairs, 2, function(motif_pair) {
      sum(daply(dat[dat$motif %in% motif_pair, ], .(id), function(dat_subset){
        all(motif_pair %in% dat_subset$motif)
      }))
    })
    motif_pair_counts <- as.data.frame(unname(cbind(t(motif_pairs), motif_pair_counts)))
    names(motif_pair_counts) <- c("motif1", "motif2", "count")
    motif_pair_counts
    
    #   motif1 motif2 count
    # 1  GATA6  GATA4     3
    # 2  GATA6    SRF     2
    # 3  GATA6  MEF2A     2
    # 4  GATA4    SRF     2
    # 5  GATA4  MEF2A     2
    # 6    SRF  MEF2A     3
    

    Another old version. PLEASE make sure your question is clear!

    This is precisely what plyr was designed to accomplish. Try dlply(dat, .(id), function(x) table(x$motif) ).

    But please don't just try to copy and paste this solution without at least reading the documentation. plyr is a very powerful package and it will be very helpful for you to understand it.


    Old post answering the wrong question:

    Are you looking for disjoint or overlapping pairs?

    Here's one solution using the function rollapply from package zoo:

    library(zoo)
    
    motif_pairs <- rollapply(dat$motif, 2, c)              # get a matrix of pairs
    motif_pairs <- apply(motif_pairs, 1, function(row) {   # for every row...
      paste0(sort(row), collapse = " ")                    #   sort the row, and concatenate it to a single string
                                                           #   (sorting ensures that pairs are not double-counted)
    })
    table(motif_pairs)                                     # since each pair is now represented by a unique string, just tabulate the string appearances
    
    ## if you want disjoint pairs, do `rollapply(dat$motif, 2, c, by = 2)` instead
    

    Take a look at the docs for rollapply if this isn't quite what you need. For grouping by other variables, you can combine this with one of:

    • base R functions aggregate or by (not recommended), or
    • the *ply functions from plyr (better)
    0 讨论(0)
提交回复
热议问题