Unexpected behaviour in data.table non-equi join

前端未结

关注

 1  872

This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table, including non-equi conditions.

相关标签:

1条回答

一个人的身影

2021-01-03 11:52
This is a workaround solution, which is not at all elegant, but appears to give the right result while the bug is not fixed.

First, we need each row in DT1 and DT2 to have a unique id. A row number will do.
```
DT1[, DT1_ID := 1:nrow(DT1)]
DT2[, DT2_ID := 1:nrow(DT2)]
```
Then, we do a following right join to find the matches:
```
M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]

head(M, 3)

   RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
1:         diejk 2016-03-30  2016-03-30     NA      1
2:         afjgf 2016-09-14  2016-09-14     NA      2
3:         kehgb 2016-12-11  2016-12-11     NA      3
```
M has each row from DT1 next to all matches for that row in DT2. When DT2_ID = NA, there was no match. nrow(M) = 100969, indicating that some DT1 rows were matched to >1 DT2 row. (Dates also took on the wrong values.)

Next, we can use an ifelse() statement to label rows in the original DT1 according to whether or not they were matched.
```
DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)
```
Final result: 13,316 matches of 100,000
```
DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 86684
2:    TRUE 13316
```
0 讨论(0)
发布评论:

提交评论
- 加载中...