R: speeding up “group by” operations

后端 未结 5 753
挽巷
挽巷 2020-11-28 19:30

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr\'s ddply() function which works great for a huge per

5条回答
  •  无人及你
    2020-11-28 20:09

    Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

    idf <- idata.frame(myDF)
    system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
       function(df) wtd.mean(df$myFact, weights=df$weights)))
    
    #    user  system elapsed 
    # 18.032   0.416  19.250 
    

    If I was to write a plyr function customised exactly to this situation, I'd do something like this:

    system.time({
      ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
      data <- as.matrix(myDF[c("myFact", "weights")])
      indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))
    
      fun <- function(rows) {
        weighted.mean(data[rows, 1], data[rows, 2])
      }
      values <- vapply(indices, fun, numeric(1))
    
      labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
        c("year", "state", "group1", "group2")]
      aggregateDF <- cbind(labels, values)
    })
    
    # user  system elapsed 
    # 2.04    0.29    2.33 
    

    It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

提交回复
热议问题