convert dplyr join syntax into pure data.table syntax

后端未结

关注

 2  1861

I am learning data.table. I have difficulty converting the dplyr join syntax. Can you please recommend the data.table equivalence for the following test cases?

相关标签:

2条回答

说谎

2020-12-09 07:34
You should look at ?data.table and go through the examples there. It's a very nice way of learning. We're in the process of writing more detailed vignettes FR #944, scheduled for 1.9.8. But until then:

- data.table form

data.table's syntax is of the form:
```
x[i, j, by, ...] # i = where, j = select|modify|update, by = group by
```
- subset operations

When i is an integer or logical expression, we call it a subset operation. For example:
```
x[a > 1]
```
What does this do? The column a from the data.table x is checked for the condition > 1, which results in a logical vector = length(a). And those rows where the condition evaluates to TRUE are identified, and all the columns corresponding to those rows are returned.

- joins as extension of subsets

concept

In data.table, joins can be seen as a natural extension of subsets. That is, we can think of a join as a subset operation, but using another data.table. This is what we mean by having a consistent syntax -- the form x[i, j, by] is intact.

The first step towards joining in data.tables is setting keys. This can be accomplished using the setkey() function whose purpose is two-fold:
- reorder the rows of the data.table in increasing order (ascending) by the columns provided. This is done by reference to be memory efficient.
- mark those columns provided as key columns on which a joins can be performed (if and when you perform a join).
Note that currently, for a join of the form x[i], x needs to have key column(s) set absolutely. i may or may not have it's key set.
- If i also has it's key column set, then joins are performed by matching the first key column of i with the first key column of x, second with second, etc..
- If i doesn't have key columns set, then the first column of i is matched to first key column of x, second column of i with second key column of x and so on..
Yes we are aware that it'd be nice to match by column names when i doesn't have key columns but we just didn't have the time to get to it yet.
The second and final step is to perform the join :-).

But how is a join operation an extension of subset? When i is a data.table, for each row in i, it finds the matching row indices in x by matching on x's key columns that we've set. This returns a set of row indices of x for each row in i (or NA if no match is found).

Now we have the matching row indices. All we have to return are the columns. But since i is also a data.table, it might have additional columns as well. So, we return the columns of both x and i for those matching row indices.

example

Here's a small example to help you internalise the concept before we move on. Consider the two data.tables X and Y as shown below:
```
X = data.table(a=c(1,1,1,2,2,5,6), b=1:7, key="a")
#    a b
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 4
# 5: 2 5
# 6: 5 6
# 7: 6 7

key(X)
# [1] "a"

Y = data.table(a=c(6,2), c=letters[1:2])
#    a c
# 1: 6 a
# 2: 2 b

key(Y)
# NULL

# join
X[Y]
#    a b c
# 1: 6 7 a
# 2: 2 4 b
# 3: 2 5 b
```
Note that we have used the key= argument in data.table() function to set the key columns directly. Alternatively we could have just created X without keys and then setkey(X, a).

The function key() returns the key columns if any. If no key is set, it returns NULL.

Y doesn't have key columns, and X has only one key column. So join is done using first column a of Y and first key column a of X. a=6 in Y matches with row 7 of X and a=2 on rows 4 and 5.

You can check this by using the argument which = TRUE:
```
X[as.data.table(6), which=TRUE] # [1] 7
X[as.data.table(2), which=TRUE] # [1] 4 5
```
This is also a handy (and fast) way to subset a data.table, but using data.table's fast binary search based subset. Since this operation is quite useful, data.table provides an easy way of doing this instead of having to write as.data.table() each time.
```
# faster way of doing X[a == 6] on data.table with 'a' as key column
X[J(6)] # J for Join
X[J(2)]

# (or)

X[.(6)] # . is an alias for J
X[.(2)]
```
I think this should further help towards understanding what we mean by subsets are extensions of joins.

back to your question

Now, let's forget for a moment about all these "left", "right", "inner", "outer" etc.. and look at the actual operation you want to perform. You've two data.tables - dtP and dtO (shortened for convenience).

case 1:

For each row of column ProductID in dtO, you want to find the matching rows in dtP, but you don't want to return NA. And you want to do also select the columns you want to output, along with some calculations.

That is, i = dtO and x = dtP. Key column for dtP is set correctly. But the key column for dtO is orderID. If we joine as such, it'll be joining orderID from dtO against productID from dtP, which is wrong.

Either we have to set key of dtO to productID or set key of dtO to NULL and move the column productID as the first column (until matching by names is implemented). Let's set key to productID here:
```
# set key
setkey(dtO, ProductID)
# join
dtP[dtO, .(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price), nomatch=0L]
```
It should be quite obvious what this does now. On only matching row indices, extract all these columns (including expression).

Why should we join first and select/aggregate after?

case 2:

Same as case 1, but you need even non matching rows. Key is already set properly from case 1.
```
# join
dtP[dtO, .(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price)]
```
Return all rows of orderID even if there's no match, and all the columns specified (including expression).

case 3:

You want all the rows in dtO that has no match with dtP.
```
not-join or anti-join
dtO[!dtP]
```
Find all rows where dtP's key column matches with dtO. Return all other rows from dtO. If necessary, you can specify all necessary columns in j as well.

case 4:

Read about by=.EACHI from this post.

You are joining by productID and then aggregating by the same column. But why do we need that intermediate result? It's totally unnecessary and waste of memory and computational time! Instead we can make use of by=.EACHI which will evaluate the j-expression on matching rows for each row in i.
```
dtO[dtP, .(.N, sQty = sum(Qty), sSales = sum(Qty*Price)), by=.EACHI, nomatch=0L][order(-sSales)]
```
To test your understanding, try to find out why we did not do dtP[dtO, ...] here..

case 5:

Identical to @eddi's.

I personally find it more natural to think in terms of the actual task I want to perform instead of figuring out the type of join function associated with the task I want to perform (I can never remember which data.table is "left" and which one is "right"... and btw, what the heck is "inner", "outer" and "full outer" join anyways?).

HTH
0 讨论(0)
发布评论:

提交评论
- 加载中...

温柔的废话

2020-12-09 07:40

setkey(dtOrder, ProductID)

(1-2)

# this will be literally what you wrote

dtProduct[dtOrder,
          list(OrderID, ProductID, ProductName, Qty, Price, ExtPrice=Qty*Price),
          nomatch = 0 # or omit this to get (2)
         ]

# but I think you'd be better off with this
dtProduct[dtOrder][, ExtPrice := Qty*Price][]

(3)

# you can again take the literal direction:
dtProduct[dtOrder][!is.na(ProductName)][,
          list(OrderID, ProductID, ProductName, Qty)]

# but again I think you'd be better off with
dtOrder[!dtProduct]

(4-5)

dtProduct[dtOrder, nomatch = 0][,
          list(OrderCount=.N, TotalQty=sum(Qty), TotalSales=sum(Qty*Price)),
          by = list(ProductID, ProductName)][
          order(-TotalSales)]

0 讨论(0)

convert dplyr join syntax into pure data.table syntax

- data.table form

- subset operations

- joins as extension of subsets

concept

example

back to your question

case 1:

case 2:

case 3:

case 4:

case 5: