data.table vs dplyr: can one do something well the other can't or does poorly?

前端 未结 4 1472
迷失自我
迷失自我 2020-11-22 08:53

Overview

I\'m relatively familiar with data.table, not so much with dplyr. I\'ve read through some dplyr vignettes and examples that hav

4条回答
  •  北恋
    北恋 (楼主)
    2020-11-22 09:19

    Reading Hadley and Arun's answers one gets the impression that those who prefer dplyr's syntax would have in some cases to switch over to data.table or compromise for long running times.

    But as some have already mentioned, dplyr can use data.table as a backend. This is accomplished using the dtplyr package which recently had it's version 1.0.0 release. Learning dtplyr incurs practically zero additional effort.

    When using dtplyr one uses the function lazy_dt() to declare a lazy data.table, after which standard dplyr syntax is used to specify operations on it. This would look something like the following:

    new_table <- mtcars2 %>% 
      lazy_dt() %>%
      filter(wt < 5) %>% 
      mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
      group_by(cyl) %>% 
      summarise(l100k = mean(l100k))
    
      new_table
    
    #> Source: local data table [?? x 2]
    #> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
    #>     keyby = .(cyl)]
    #> 
    #>     cyl l100k
    #>    
    #> 1     4  9.05
    #> 2     6 12.0 
    #> 3     8 14.9 
    #> 
    #> # Use as.data.table()/as.data.frame()/as_tibble() to access results
    

    The new_table object is not evaluated until calling on it as.data.table()/as.data.frame()/as_tibble() at which point the underlying data.table operation is executed.

    I've recreated a benchmark analysis done by data.table author Matt Dowle back at December 2018 which covers the case of operations over large numbers of groups. I've found that dtplyr indeed enables for the most part those who prefer the dplyr syntax to keep using it while enjoying the speed offered by data.table.

提交回复
热议问题