data.table conditional Inequality join

谁说我不能喝 提交于 2020-01-30 06:00:26

问题


There're two sample datasets:

> aDT
   col1 col2 ExtractDate
1:    1    A  2017-01-01
2:    1    A  2016-01-01
3:    2    B  2015-01-01
4:    2    B  2014-01-01
> bDT
   col1 col2   date_pol Value
1:    1    A 2017-05-20     1
2:    1    A 2016-05-20     2
3:    1    A 2015-05-20     3
4:    2    B 2014-05-20     4

And I need:

> cDT
   col1 col2 ExtractDate   date_pol Value
1:    1    A  2017-01-01 2016-05-20     2
2:    1    A  2016-01-01 2015-05-20     3
3:    2    B  2015-01-01 2014-05-20     4
4:    2    B  2014-01-01         NA    NA

Basically, aDT left join bDT based on col1, col2 and ExtractDate >= date_pol, only keep the first match (i.e. highest date_pol). Cartesian join not allowed due to memory limits.

Note: To generate sample datasets

aDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01"))
bDT <- data.table(col1 = c(1,1,1,2), col2 = c("A","A","A","B"), date_pol = c("2017-05-20","2016-05-20","2015-05-20","2014-05-20"), Value = c(1,2,3,4))
cDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01")
                  ,date_pol = c("2016-05-20","2015-05-20","2014-05-20",NA), Value = c(2,3,4,NA))


aDT[,ExtractDate := ymd(ExtractDate)]
bDT[,date_pol := ymd(date_pol)]
aDT[order(-ExtractDate)]
bDT[order(-date_pol)]

I have tried:

aDT[, c("date_pol", "Value") :=
      bDT[aDT, 
          .(date_pol, Value)
          ,on = .(date_pol <= ExtractDate
                ,col1 = col1
                ,col2 = col2)
          ,mult = "first"]]

But results are a bit weird:

> aDT
   col1 col2 ExtractDate   date_pol Value ##date_pol values not right
1:    1    A  2017-01-01 2017-01-01     2
2:    1    A  2016-01-01 2016-01-01     3
3:    2    B  2015-01-01 2015-01-01     4
4:    2    B  2014-01-01 2014-01-01    NA

回答1:


When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"].

bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
    on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2), 
    mult = "first"]

output

   col1 col2 i.ExtractDate x.date_pol Value
1:    1    A    2017-01-01 2016-05-20     2
2:    1    A    2016-01-01 2015-05-20     3
3:    2    B    2015-01-01 2014-05-20     4
4:    2    B    2014-01-01       <NA>    NA



回答2:


I like the approach you did yourself: without explicitly mentioning the columns in your left join. This can be very helpful if you have a lot of columns on the left side of your join, so you don't have to specify them all.

The only thing you need to do is use the prefix x.

aDT[, c("date_pol", "Value") := bDT[aDT, on = .(date_pol <= ExtractDate, col1, col2), 
    mult = "first", .(x.date_pol, x.Value)]]

Output:

   col1 col2 ExtractDate   date_pol Value
1:    1    A  2017-01-01 2016-05-20     2
2:    1    A  2016-01-01 2015-05-20     3
3:    2    B  2015-01-01 2014-05-20     4
4:    2    B  2014-01-01       <NA>    NA


来源:https://stackoverflow.com/questions/47524918/data-table-conditional-inequality-join

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!