问题
There're two sample datasets:
> aDT
col1 col2 ExtractDate
1: 1 A 2017-01-01
2: 1 A 2016-01-01
3: 2 B 2015-01-01
4: 2 B 2014-01-01
> bDT
col1 col2 date_pol Value
1: 1 A 2017-05-20 1
2: 1 A 2016-05-20 2
3: 1 A 2015-05-20 3
4: 2 B 2014-05-20 4
And I need:
> cDT
col1 col2 ExtractDate date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 NA NA
Basically, aDT left join bDT based on col1, col2 and ExtractDate >= date_pol, only keep the first match (i.e. highest date_pol). Cartesian join not allowed due to memory limits.
Note: To generate sample datasets
aDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01"))
bDT <- data.table(col1 = c(1,1,1,2), col2 = c("A","A","A","B"), date_pol = c("2017-05-20","2016-05-20","2015-05-20","2014-05-20"), Value = c(1,2,3,4))
cDT <- data.table(col1 = c(1,1,2,2), col2 = c("A","A","B","B"), ExtractDate = c("2017-01-01","2016-01-01","2015-01-01","2014-01-01")
,date_pol = c("2016-05-20","2015-05-20","2014-05-20",NA), Value = c(2,3,4,NA))
aDT[,ExtractDate := ymd(ExtractDate)]
bDT[,date_pol := ymd(date_pol)]
aDT[order(-ExtractDate)]
bDT[order(-date_pol)]
I have tried:
aDT[, c("date_pol", "Value") :=
bDT[aDT,
.(date_pol, Value)
,on = .(date_pol <= ExtractDate
,col1 = col1
,col2 = col2)
,mult = "first"]]
But results are a bit weird:
> aDT
col1 col2 ExtractDate date_pol Value ##date_pol values not right
1: 1 A 2017-01-01 2017-01-01 2
2: 1 A 2016-01-01 2016-01-01 3
3: 2 B 2015-01-01 2015-01-01 4
4: 2 B 2014-01-01 2014-01-01 NA
回答1:
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]
. Here val refers to X's column and i.val Y's. Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's. For example, X[Y, .(x.a-i.a, b), on="a"]
.
bDT[aDT, .(col1, col2, i.ExtractDate, x.date_pol, Value),
on = .(date_pol <= ExtractDate, col1 = col1, col2 = col2),
mult = "first"]
output
col1 col2 i.ExtractDate x.date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA
回答2:
I like the approach you did yourself: without explicitly mentioning the columns in your left join. This can be very helpful if you have a lot of columns on the left side of your join, so you don't have to specify them all.
The only thing you need to do is use the prefix x.
aDT[, c("date_pol", "Value") := bDT[aDT, on = .(date_pol <= ExtractDate, col1, col2),
mult = "first", .(x.date_pol, x.Value)]]
Output:
col1 col2 ExtractDate date_pol Value
1: 1 A 2017-01-01 2016-05-20 2
2: 1 A 2016-01-01 2015-05-20 3
3: 2 B 2015-01-01 2014-05-20 4
4: 2 B 2014-01-01 <NA> NA
来源:https://stackoverflow.com/questions/47524918/data-table-conditional-inequality-join