This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table
, including non-equi conditions.
This is a workaround solution, which is not at all elegant, but appears to give the right result while the bug is not fixed.
First, we need each row in DT1
and DT2
to have a unique id. A row number will do.
DT1[, DT1_ID := 1:nrow(DT1)]
DT2[, DT2_ID := 1:nrow(DT2)]
Then, we do a following right join to find the matches:
M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]
head(M, 3)
RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
1: diejk 2016-03-30 2016-03-30 NA 1
2: afjgf 2016-09-14 2016-09-14 NA 2
3: kehgb 2016-12-11 2016-12-11 NA 3
M
has each row from DT1
next to all matches for that row in DT2
. When DT2_ID = NA
, there was no match. nrow(M) = 100969
, indicating that some DT1
rows were matched to >1 DT2
row. (Dates also took on the wrong values.)
Next, we can use an ifelse()
statement to label rows in the original DT1
according to whether or not they were matched.
DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)
Final result: 13,316 matches of 100,000
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 86684
2: TRUE 13316