Piggybacking on @Anna 's answer, I just ran a few of the options to see which was faster on a larger dataset for a problem have at work. I used the setup from here (Faster way to subset on rows of a data frame in R?) I checked it on a 1 billion row (16gb) dataset. Looks like data.table edged out dplyr by a little bit. I am just starting to use data.table though so I may have not used the most efficient code. Oh, also, I narrowed it down to these 4 based on times from a 100million row dataset. See below:
set.seed(42)
# 1 billion rows
df <- data.frame(age=sample(1:65,1e9,replace=TRUE),x=rnorm(1e9),y=rpois(1e9,25))
microbenchmark(df1 <- df %>% filter(age >= 5 & age <= 25),
df2 <- df %>% filter(dplyr::between(df$age, 5, 25)),
times=10)
Unit: seconds
expr min lq mean median uq max neval
df %>% filter(age >= 5 & age <= 25) 15.327 15.796 16.526 16.601 17.086 17.996 10
df %>% filter(dplyr::between(df$age, 5, 25)) 14.214 14.752 15.413 15.487 16.121 16.447 10
DT <- as.data.table(df)
microbenchmark(dt1 <- DT[age %inrange% c(5, 25)],
dt2 <- DT[age %between% c(5, 25)],
times = 10)
Unit: seconds
expr min lq mean median uq max neval
dt1 <- DT[age %inrange% c(5, 25)] 15.122 16.042 17.180 16.969 17.310 22.138 10
dt2 <- DT[age %between% c(5, 25)] 10.212 11.121 11.675 11.436 12.132 13.913 10