I\'m relatively familiar with data.table
, not so much with dplyr
. I\'ve read through some dplyr vignettes and examples that hav
Reading Hadley and Arun's answers one gets the impression that those who prefer dplyr
's syntax would have in some cases to switch over to data.table
or compromise for long running times.
But as some have already mentioned, dplyr
can use data.table
as a backend. This is accomplished using the dtplyr
package which recently had it's version 1.0.0 release. Learning dtplyr
incurs practically zero additional effort.
When using dtplyr
one uses the function lazy_dt()
to declare a lazy data.table, after which standard dplyr
syntax is used to specify operations on it. This would look something like the following:
new_table <- mtcars2 %>%
lazy_dt() %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k))
new_table
#> Source: local data table [?? x 2]
#> Call: `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)),
#> keyby = .(cyl)]
#>
#> cyl l100k
#>
#> 1 4 9.05
#> 2 6 12.0
#> 3 8 14.9
#>
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
The new_table
object is not evaluated until calling on it as.data.table()
/as.data.frame()
/as_tibble()
at which point the underlying data.table
operation is executed.
I've recreated a benchmark analysis done by data.table
author Matt Dowle back at December 2018 which covers the case of operations over large numbers of groups. I've found that dtplyr
indeed enables for the most part those who prefer the dplyr
syntax to keep using it while enjoying the speed offered by data.table
.