I have a data frame with 3 columns: custId, saleDate, DelivDateTime.
> head(events22)
custId saleDate DelivDate
1 280356593 2012-11-1
Here's a much faster data.table function:
DATATABLE <- function() {
dt <- data.table(events, key=c('custId', 'saleDate'))
dt[, maxrow := 1:.N==.N, by = custId]
return(dt[maxrow==TRUE, list(custId, DelivDate)])
}
Note that this function creates a data.table and sorts the data, which is a step you'd only need to perform once. If you remove this step (perhaps you have a multi-step data processing pipeline, and create the data.table once, as a first step), the function is more than twice as fast.
I also modified all the previous functions to return the result, for easier comparison:
DDPLY <- function() {
return(ddply(events, .(custId), .inform = T,
function(x) {
x[x$saleDate == max(x$saleDate),"DelivDate"]}))
}
AGG1 <- function() {
return(merge(events, aggregate(saleDate ~ custId, events, max)))}
SQLDF <- function() {
return(sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events group by custId"))}
DOCALL <- function() {
return(do.call(rbind,
lapply(split(events, events$custId), function(x){
x[which.max(x$saleDate), ]
})
))
}
Here's the results for 10k rows, repeated 10 times:
library(rbenchmark)
library(plyr)
library(data.table)
library(sqldf)
events <- do.call(rbind, lapply(1:500, function(x) events22))
events$custId <- sample(1:nrow(events), nrow(events))
benchmark(a <- DDPLY(), b <- DATATABLE(), c <- AGG1(), d <- SQLDF(),
e <- DOCALL(), order = "elapsed", replications=10)[1:5]
test replications elapsed relative user.self
2 b <- DATATABLE() 10 0.13 1.000 0.13
4 d <- SQLDF() 10 0.42 3.231 0.41
3 c <- AGG1() 10 12.11 93.154 12.03
1 a <- DDPLY() 10 32.17 247.462 32.01
5 e <- DOCALL() 10 56.05 431.154 55.85
Since all the functions return their results, we can verify they all return the same answer:
c <- c[order(c$custId),]
dim(a); dim(b); dim(c); dim(d); dim(e)
all(a$V1==b$DelivDate)
all(a$V1==c$DelivDate)
all(a$V1==d$DelivDate)
all(a$V1==e$DelivDate)
/Edit: On the smaller, 20 row dataset, data.table is still the fastest, but by a thinner margin:
test replications elapsed relative user.self
2 b <- DATATABLE() 100 0.22 1.000 0.22
3 c <- AGG1() 100 0.42 1.909 0.42
5 e <- DOCALL() 100 0.48 2.182 0.49
1 a <- DDPLY() 100 0.55 2.500 0.55
4 d <- SQLDF() 100 1.00 4.545 0.98
/Edit2: If we remove the data.table creation from the function we get the following results:
dt <- data.table(events, key=c('custId', 'saleDate'))
DATATABLE2 <- function() {
dt[, maxrow := 1:.N==.N, by = custId]
return(dt[maxrow==TRUE, list(custId, DelivDate)])
}
benchmark(a <- DDPLY(), b <- DATATABLE2(), c <- AGG1(), d <- SQLDF(),
e <- DOCALL(), order = "elapsed", replications=10)[1:5]
test replications elapsed relative user.self
2 b <- DATATABLE() 10 0.09 1.000 0.08
4 d <- SQLDF() 10 0.41 4.556 0.39
3 c <- AGG1() 10 11.73 130.333 11.67
1 a <- DDPLY() 10 31.59 351.000 31.50
5 e <- DOCALL() 10 55.05 611.667 54.91