Unexpected behaviour in data.table non-equi join

前端 未结 1 871
栀梦
栀梦 2021-01-03 11:01

This is a follow-up on this question, where the accepted answer showed an example of a matching exercise using data.table, including non-equi conditions.

<
相关标签:
1条回答
  • 2021-01-03 11:52

    This is a workaround solution, which is not at all elegant, but appears to give the right result while the bug is not fixed.

    First, we need each row in DT1 and DT2 to have a unique id. A row number will do.

    DT1[, DT1_ID := 1:nrow(DT1)]
    DT2[, DT2_ID := 1:nrow(DT2)]
    

    Then, we do a following right join to find the matches:

    M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]
    
    head(M, 3)
    
       RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
    1:         diejk 2016-03-30  2016-03-30     NA      1
    2:         afjgf 2016-09-14  2016-09-14     NA      2
    3:         kehgb 2016-12-11  2016-12-11     NA      3
    

    M has each row from DT1 next to all matches for that row in DT2. When DT2_ID = NA, there was no match. nrow(M) = 100969, indicating that some DT1 rows were matched to >1 DT2 row. (Dates also took on the wrong values.)

    Next, we can use an ifelse() statement to label rows in the original DT1 according to whether or not they were matched.

    DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)
    

    Final result: 13,316 matches of 100,000

    DT1[, .N, by=MATCHED]
    
       MATCHED     N
    1:   FALSE 86684
    2:    TRUE 13316
    
    0 讨论(0)
提交回复
热议问题