R: speeding up “group by” operations

后端 未结 5 750
挽巷
挽巷 2020-11-28 19:30

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr\'s ddply() function which works great for a huge per

5条回答
  •  野趣味
    野趣味 (楼主)
    2020-11-28 19:49

    Are you using the latest version of plyr (note: this hasn't made it to all the CRAN mirrors yet)? If so, you could just run this in parallel.

    Here's the llply example, but the same should apply to ddply:

      x <- seq_len(20)
      wait <- function(i) Sys.sleep(0.1)
      system.time(llply(x, wait))
      #  user  system elapsed 
      # 0.007   0.005   2.005 
    
      library(doMC)
      registerDoMC(2) 
      system.time(llply(x, wait, .parallel = TRUE))
      #  user  system elapsed 
      # 0.020   0.011   1.038 
    

    Edit:

    Well, other looping approaches are worse, so this probably requires either (a) C/C++ code or (b) a more fundamental rethinking of how you're doing it. I didn't even try using by() because that's very slow in my experience.

    groups <- unique(myDF[,c("year", "state", "group1", "group2")])
    system.time(
    aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
       df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
       cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
    }))
    )
    
    aggregateDF <- data.frame()
    system.time(
    for(i in 1:nrow(groups)) {
       df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
       aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
    }
    )
    

提交回复
热议问题