data.table vs dplyr: can one do something well the other can't or does poorly?

前端未结

关注

 4  1480

迷失自我 2020-11-22 08:53

Overview

I\'m relatively familiar with data.table, not so much with dplyr. I\'ve read through some dplyr vignettes and examples that hav

4条回答

旧巷少年郎 (楼主)

2020-11-22 09:19

In direct response to the Question Title...

dplyr definitely does things that data.table can not.

Your point #3

dplyr abstracts (or will) potential DB interactions

is a direct answer to your own question but isn't elevated to a high enough level. dplyr is truly an extendable front-end to multiple data storage mechanisms where as data.table is an extension to a single one.

Look at dplyr as a back-end agnostic interface, with all of the targets using the same grammer, where you can extend the targets and handlers at will. data.table is, from the dplyr perspective, one of those targets.

You will never (I hope) see a day that data.table attempts to translate your queries to create SQL statements that operate with on-disk or networked data stores.

dplyr can possibly do things data.table will not or might not do as well.

Based on the design of working in-memory, data.table could have a much more difficult time extending itself into parallel processing of queries than dplyr.

In response to the in-body questions...

Usage

Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).

This may seem like a punt but the real answer is no. People familiar with tools seem to use the either the one most familiar to them or the one that is actually the right one for the job at hand. With that being said, sometimes you want to present a particular readability, sometimes a level of performance, and when you have need for a high enough level of both you may just need another tool to go along with what you already have to make clearer abstractions.

Performance

Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

Again, no. data.table excels at being efficient in everything it does where dplyr gets the burden of being limited in some respects to the underlying data store and registered handlers.

This means when you run into a performance issue with data.table you can be pretty sure it is in your query function and if it is actually a bottleneck with data.table then you've won yourself the joy of filing a report. This is also true when dplyr is using data.table as the back-end; you may see some overhead from dplyr but odds are it is your query.

When dplyr has performance issues with back-ends you can get around them by registering a function for hybrid evaluation or (in the case of databases) manipulating the generated query prior to execution.

Also see the accepted answer to when is plyr better than data.table?

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

data.table vs dplyr: can one do something well the other can't or does poorly?

Overview

In direct response to the Question Title...

`dplyr` definitely does things that `data.table` can not.

`dplyr` can possibly do things `data.table` will not or might not do as well.

In response to the in-body questions...

Usage

Performance

data.table vs dplyr: can one do something well the other can't or does poorly?

Overview

In direct response to the Question Title...

dplyr definitely does things that data.table can not.

dplyr can possibly do things data.table will not or might not do as well.

In response to the in-body questions...

Usage

Performance

`dplyr` definitely does things that `data.table` can not.

`dplyr` can possibly do things `data.table` will not or might not do as well.