data.table vs dplyr: can one do something well the other can't or does poorly?

前端未结

关注

 4  1472

迷失自我 2020-11-22 08:53

Overview

I\'m relatively familiar with data.table, not so much with dplyr. I\'ve read through some dplyr vignettes and examples that hav

4条回答

北恋 (楼主)

2020-11-22 09:19
Reading Hadley and Arun's answers one gets the impression that those who prefer dplyr's syntax would have in some cases to switch over to data.table or compromise for long running times.

But as some have already mentioned, dplyr can use data.table as a backend. This is accomplished using the dtplyr package which recently had it's version 1.0.0 release. Learning dtplyr incurs practically zero additional effort.

When using dtplyr one uses the function lazy_dt() to declare a lazy data.table, after which standard dplyr syntax is used to specify operations on it. This would look something like the following:
```
new_table <- mtcars2 %>% 
  lazy_dt() %>%
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))

  new_table

#> Source: local data table [?? x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>    
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
```
The new_table object is not evaluated until calling on it as.data.table()/as.data.frame()/as_tibble() at which point the underlying data.table operation is executed.

I've recreated a benchmark analysis done by data.table author Matt Dowle back at December 2018 which covers the case of operations over large numbers of groups. I've found that dtplyr indeed enables for the most part those who prefer the dplyr syntax to keep using it while enjoying the speed offered by data.table.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...