data.table conditional Inequality join

问题

There're two sample datasets:

> aDT
   col1 col2 ExtractDate
1:    1    A  2017-01-01
2:    1    A  2016-01-01
3:    2    B  2015-01-01
4:    2    B  2014-01-01
> bDT
   col1 col2   date_pol Value
1:    1    A 2017-05-20     1
2:    1    A 2016-05-20     2
3:    1    A 2015-05-20     3
4:    2    B 2014-05-20     4

And I need:

> cDT
   col1 col2 ExtractDate   date_pol Value
1:    1    A  2017-01-01 2016-05-20     2
2:    1    A  2016-01-01 2015-05-20     3
3:    2    B  2015-01-01 2014-05-20     4
4:    2    B  2014-01-01         NA    NA

Basically, aDT left join bDT based on col1, col2 and ExtractDate >= date_pol, only keep the first match (i.e. highest date_pol). Cartesian join not allowed due to memory limits.

Note: To generate sample datasets

aDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01"))
bDT <- data.table(col1 = c(1,1,1,2), col2 = c("A","A","A","B"), date_pol = c("2017-05-20","2016-05-20","2015-05-20","2014-05-20"), Value = c(1,2,3,4))
cDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01")
                  ,date_pol = c("2016-05-20","2015-05-20","2014-05-20",NA), Value = c(2,3,4,NA))


aDT[,ExtractDate := ymd(ExtractDate)]
bDT[,date_pol := ymd(date_pol)]
aDT[order(-ExtractDate)]
bDT[order(-date_pol)]

I have tried:

aDT[, c("date_pol", "Value") :=
      bDT[aDT, 
          .(date_pol, Value)
          ,on = .(date_pol <= ExtractDate
                ,col1 = col1
                ,col2 = col2)
          ,mult = "first"]]

But results are a bit weird:

> aDT
   col1 col2 ExtractDate   date_pol Value ##date_pol values not right
1:    1    A  2017-01-01 2017-01-01     2
2:    1    A  2016-01-01 2016-01-01     3
3:    2    B  2015-01-01 2015-01-01     4
4:    2    B  2014-01-01 2014-01-01    NA

回答1:

When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"].

bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
    on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2), 
    mult = "first"]

output

   col1 col2 i.ExtractDate x.date_pol Value
1:    1    A    2017-01-01 2016-05-20     2
2:    1    A    2016-01-01 2015-05-20     3
3:    2    B    2015-01-01 2014-05-20     4
4:    2    B    2014-01-01       <NA>    NA

回答2:

I like the approach you did yourself: without explicitly mentioning the columns in your left join. This can be very helpful if you have a lot of columns on the left side of your join, so you don't have to specify them all.

The only thing you need to do is use the prefix x.

aDT[, c("date_pol", "Value") := bDT[aDT, on = .(date_pol <= ExtractDate, col1, col2), 
    mult = "first", .(x.date_pol, x.Value)]]

Output:

   col1 col2 ExtractDate   date_pol Value
1:    1    A  2017-01-01 2016-05-20     2
2:    1    A  2016-01-01 2015-05-20     3
3:    2    B  2015-01-01 2014-05-20     4
4:    2    B  2014-01-01       <NA>    NA

来源：https://stackoverflow.com/questions/47524918/data-table-conditional-inequality-join

标签

join

data.table

conditional