R use ddply or aggregate

后端 未结 4 1104
渐次进展
渐次进展 2020-12-01 21:58

I have a data frame with 3 columns: custId, saleDate, DelivDateTime.

> head(events22)
     custId            saleDate      DelivDate
1 280356593 2012-11-1         


        
4条回答
  •  醉酒成梦
    2020-12-01 22:26

    Here's a much faster data.table function:

    DATATABLE <- function() { 
      dt <- data.table(events, key=c('custId', 'saleDate'))
      dt[, maxrow := 1:.N==.N, by = custId]
      return(dt[maxrow==TRUE, list(custId, DelivDate)])
    }
    

    Note that this function creates a data.table and sorts the data, which is a step you'd only need to perform once. If you remove this step (perhaps you have a multi-step data processing pipeline, and create the data.table once, as a first step), the function is more than twice as fast.

    I also modified all the previous functions to return the result, for easier comparison:

    DDPLY <- function() { 
      return(ddply(events, .(custId), .inform = T, 
                   function(x) {
                     x[x$saleDate == max(x$saleDate),"DelivDate"]}))
    }
    AGG1 <- function() { 
      return(merge(events, aggregate(saleDate ~ custId, events, max)))}
    
    SQLDF <- function() { 
      return(sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
                 from events group by custId"))}
    DOCALL <- function() {
      return(do.call(rbind, 
                     lapply(split(events, events$custId), function(x){
                       x[which.max(x$saleDate), ]
                     })
      ))
    }
    

    Here's the results for 10k rows, repeated 10 times:

    library(rbenchmark)
    library(plyr)
    library(data.table)
    library(sqldf)
    events <- do.call(rbind, lapply(1:500, function(x) events22))
    events$custId <- sample(1:nrow(events), nrow(events))
    
    benchmark(a <- DDPLY(), b <- DATATABLE(), c <- AGG1(), d <- SQLDF(),
     e <- DOCALL(), order = "elapsed", replications=10)[1:5]
    
                  test replications elapsed relative user.self
    2 b <- DATATABLE()           10    0.13    1.000      0.13
    4     d <- SQLDF()           10    0.42    3.231      0.41
    3      c <- AGG1()           10   12.11   93.154     12.03
    1     a <- DDPLY()           10   32.17  247.462     32.01
    5    e <- DOCALL()           10   56.05  431.154     55.85
    

    Since all the functions return their results, we can verify they all return the same answer:

    c <- c[order(c$custId),]
    dim(a); dim(b); dim(c); dim(d); dim(e)
    all(a$V1==b$DelivDate)
    all(a$V1==c$DelivDate)
    all(a$V1==d$DelivDate)
    all(a$V1==e$DelivDate)
    

    /Edit: On the smaller, 20 row dataset, data.table is still the fastest, but by a thinner margin:

                  test replications elapsed relative user.self
    2 b <- DATATABLE()          100    0.22    1.000      0.22
    3      c <- AGG1()          100    0.42    1.909      0.42
    5    e <- DOCALL()          100    0.48    2.182      0.49
    1     a <- DDPLY()          100    0.55    2.500      0.55
    4     d <- SQLDF()          100    1.00    4.545      0.98
    

    /Edit2: If we remove the data.table creation from the function we get the following results:

    dt <- data.table(events, key=c('custId', 'saleDate'))
    DATATABLE2 <- function() { 
      dt[, maxrow := 1:.N==.N, by = custId]
      return(dt[maxrow==TRUE, list(custId, DelivDate)])
    }
    benchmark(a <- DDPLY(), b <- DATATABLE2(), c <- AGG1(), d <- SQLDF(),
               e <- DOCALL(), order = "elapsed", replications=10)[1:5]
                  test replications elapsed relative user.self
    2 b <- DATATABLE()           10    0.09    1.000      0.08
    4     d <- SQLDF()           10    0.41    4.556      0.39
    3      c <- AGG1()           10   11.73  130.333     11.67
    1     a <- DDPLY()           10   31.59  351.000     31.50
    5    e <- DOCALL()           10   55.05  611.667     54.91
    

提交回复
热议问题