R use ddply or aggregate

后端 未结 4 1102
渐次进展
渐次进展 2020-12-01 21:58

I have a data frame with 3 columns: custId, saleDate, DelivDateTime.

> head(events22)
     custId            saleDate      DelivDate
1 280356593 2012-11-1         


        
4条回答
  •  栀梦
    栀梦 (楼主)
    2020-12-01 22:12

    I, too, would recommend data.table here, but since you asked for an aggregate solution, here is one which combines aggregate and merge to get all the columns:

    merge(events22, aggregate(saleDate ~ custId, events22, max))
    

    Or just aggregate if you only want the "custId" and "DelivDate" columns:

    aggregate(list(DelivDate = events22$saleDate), 
              list(custId = events22$custId),
              function(x) events22[["DelivDate"]][which.max(x)])
    

    Finally, here's an option using sqldf:

    library(sqldf)
    sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
          from events22 group by custId")
    

    Benchmarks

    I'm not a benchmarking or data.table expert, but it surprised me that data.table is not faster here. My suspicion is that the results would be quite different on a larger dataset, say for instance, your 400k lines one. Anyway, here's some benchmarking code modeled after @mnel's answer here so you can do some tests on your actual dataset for future reference.

    library(rbenchmark)
    

    First, set up your functions for what you want to benchmark.

    DDPLY <- function() { 
      x <- ddply(events22, .(custId), .inform = T, 
                 function(x) {
                   x[x$saleDate == max(x$saleDate),"DelivDate"]}) 
    }
    DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] }
    AGG1 <- function() { 
      x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) }
    AGG2 <- function() { 
      x <- aggregate(list(DelivDate = events22$saleDate), 
                     list(custId = events22$custId),
                     function(x) events22[["DelivDate"]][which.max(x)]) }
    SQLDF <- function() { 
      x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
                 from events22 group by custId") }
    DOCALL <- function() {
      do.call(rbind, 
              lapply(split(events22, events22$custId), function(x){
                x[which.max(x$saleDate), ]
              })
      )
    }
    

    Second, do the benchmarking.

    benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(), 
              order = "elapsed")[1:5]
    #          test replications elapsed relative user.self
    # 4      AGG2()          100   0.285    1.000     0.284
    # 3      AGG1()          100   0.891    3.126     0.896
    # 6    DOCALL()          100   1.202    4.218     1.204
    # 2 DATATABLE()          100   1.251    4.389     1.248
    # 1     DDPLY()          100   1.254    4.400     1.252
    # 5     SQLDF()          100   2.109    7.400     2.108
    

提交回复
热议问题