问题
A follow-up from this question.
I have three data tables (the actual input one is way bigger and performance matters, so I have to use data.table as much as I can):
input <- fread(" ID | T1 | T2 | T3 | DATE
ACC001 | 1 | 0 | 0 | 31/12/2016
ACC001 | 1 | 0 | 1 | 30/06/2017
ACC002 | 0 | 1 | 1 | 31/12/2016", sep = "|")
mevs <- fread(" DATE | INDEX_NAME | INDEX_VALUE
31/12/2016 | GDP | 1.05
30/06/2017 | GDP | 1.06
31/12/2017 | GDP | 1.07
30/06/2018 | GDP | 1.08
31/12/2016 | CPI | 0.02
30/06/2017 | CPI | 0.00
31/12/2017 | CPI | -0.01
30/06/2018 | CPI | 0.01 ", sep = "|")
time <- fread(" DATE
31/12/2017
30/06/2018 ", sep = "|")
With those, I need to achieve 2 things:
Insert
GDPandCPIvalues from the second dt(mevs) into the first one (input), to make some calculations in the last column based onT1,T2,T3,GDPandCPI.Make a projection for the time intervals given in the third dt (
time), copyingT1,T2andT3values in the previous interval in the sameID(so ACC001 ones would remain1, 0, 1) if it exists (filling them with0if it doesn't) and gettingGDPandCPIfrom the corresponding dates.
For that, I'm using the following pieces of code:
ones <- input[, .N, by = ID][N == 1, ID]
input[, .SD[time, on = "DATE"], by = ID
][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
, by = ID, .SDcols = 2:4][]
Which does (thanks to @Jaap):
input[, .SD[time, on = "DATE"], by = ID]joins for each ID the time data.table to the remaining columns, thus extending the data.table.A wide version of mevs
(dcast(mevs, DATE ~ INDEX_NAME))is then joined to the extended data.table.Finally the missing values in the extended data.table are filled with the
na.locf-function from thezoopackage.
The intended output would be:
ID T1 T2 T3 DATE GDP CPI
1: ACC001 1 0 0 31/12/2016 1.05 0.02
2: ACC001 1 0 1 30/06/2017 1.06 0.00
3: ACC001 1 0 1 31/12/2017 1.07 -0.01
4: ACC001 1 0 1 30/06/2018 1.08 0.01
5: ACC002 0 1 1 31/12/2016 1.05 0.02
6: ACC002 0 0 0 30/06/2017 1.06 0.00
7: ACC002 0 0 0 31/12/2017 1.07 -0.01
8: ACC002 0 0 0 30/06/2018 1.08 0.01
But instead what I get is:
ID T1 T2 T3 DATE GDP CPI
1: ACC001 NA NA NA 31/12/2017 1.07 -0.01
2: ACC001 NA NA NA 30/06/2018 1.08 0.01
3: ACC002 NA NA NA 31/12/2017 1.07 -0.01
4: ACC002 NA NA NA 30/06/2018 1.08 0.01
I'm almost sure that it has to be a wrong join choice between input and time in the first step, but I can't find a workaround for this.
Thanks everyone for your time.
回答1:
A possible solution:
times <- unique(rbindlist(list(time, as.data.table(unique(input$DATE))))
)[, DATE := as.Date(DATE, "%d/%m/%Y")][order(DATE)]
input[, DATE := as.Date(DATE, "%d/%m/%Y")]
mevs[, DATE := as.Date(DATE, "%d/%m/%Y")]
ones <- input[, .N, by = ID][N == 1, ID]
input[, .SD[times, on = "DATE"], by = ID
][dcast(mevs, DATE ~ INDEX_NAME), on = "DATE", `:=` (GDP = i.GDP, CPI = i.CPI)
][, (2:4) := lapply(.SD, function(x) if (.BY %in% ones) replace(x, is.na(x), 0L) else zoo::na.locf(x) )
, by = ID, .SDcols = 2:4][]
which gives:
ID T1 T2 T3 DATE GDP CPI 1: ACC001 1 0 0 2016-12-31 1.05 0.02 2: ACC001 1 0 1 2017-06-30 1.06 0.00 3: ACC001 1 0 1 2017-12-31 1.07 -0.01 4: ACC001 1 0 1 2018-06-30 1.08 0.01 5: ACC002 0 1 1 2016-12-31 1.05 0.02 6: ACC002 0 0 0 2017-06-30 1.06 0.00 7: ACC002 0 0 0 2017-12-31 1.07 -0.01 8: ACC002 0 0 0 2018-06-30 1.08 0.01
来源:https://stackoverflow.com/questions/51170892/selecting-correct-join-with-data-table