R sum by group if date within date range

浪子不回头ぞ 提交于 2021-01-28 14:26:17

问题


Suppose I have two dataframes.

The first one includes "Date" at which a "Name" issues a "Rec" for an "ID" and the "Stop.Date" at which "Rec" becomes invalid.

df (only a part)

structure(list(Date = structure(c(13236, 13363, 14074, 13199, 
14554), class = "Date"), ID = c("AU0000XINAA9", "AU0000XINAA9", 
"AU0000XINAC5", "AU0000XINAI2", "AU0000XINAJ0"), Name = c("N+1 BREWIN", 
"N+1 BREWIN", "ARBUTHNOT SECURITIES LTD.", "INVESTEC BANK (UK) PLC", 
"AWRAQ INVESTMENTS"), Rec = c(1, 2, 2, 2, 1), Stop.Date = structure(c(13363, 
13509, 14937, 13230, 16702), class = "Date")), .Names = c("Date", 
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L))

The Second dataframe only contains a time series: Let's say in this case from 2006-03-29 until end of 2006.

df2

      Date1
  1: 2006-02-20
  2: 2006-02-21
  3: 2006-02-22
  4: 2006-02-23
  5: 2006-02-24
 ---           
311: 2006-12-27
312: 2006-12-28
313: 2006-12-29
314: 2006-12-30
315: 2006-12-31

Now I want my code to sum all "Rec" gouped by ID and Name if the "Date1" variable in df2 falls within the time range (Date until Stop.Date)

I found this post R - If date falls within range, then sum and it seems very close to my problem but the solution does not consider any groups.

I want to come up with a data.frame in which for each date in df2 the sum of "REC" for each single "ID" is shown.

Expected output e.g.

        Date1         ID          SumRec 

    1 2006-02-20 AU0000XINAI2        2
    2 2006-02-21 AU0000XINAI2        2
...
    4 2006-03-29 AU0000XINAA9        1
    5 2006-03-30 AU0000XINAA9        1
    6 2006-08-03 AU0000XINAA9        2  # since Date1 2006-08-03 is at the end 
                                          of range in df (row#1)-> it falls 
                                          within range in df (row#2) 
...

Please keep in mind this is only a small part of the data. Usually there exists many more Recs for each "ID" from different "Names". (then sum function makes sense)

Many thanks for your help in advance.

UPDATED VERSION

new dataframes:

df

structure(list(Date = structure(c(9905, 10381, 10381, 10954, 
10584, 10632, 10778, 10520, 10631, 10905), class = "Date"), ID = c("BMG4593F1389", 
"BMG4593F1389", "BMG4593F1389", "BMG4593F1389", "BMG4593F1389", 
"BMG4593F1389", "BMG4593F1389", "BMG526551004", "BMG526551004", 
"BMG526551004"), Name = c("ING FM", "Permission Denied 128064", 
"Permission Denied 2880", "Permission Denied 2880", "Permission Denied 32", 
"Permission Denied 888", "Permission Denied 888", "Permission Denied 2880", 
"Permission Denied 2880", "Permission Denied 2880"), Rec = c(2, 
3, 2, 2, 3, 3, 3, 1, 3, 3), Stop.Date = structure(c(12095, 11232, 
10954, 11180, 11345, 10764, 11667, 10631, 10905, 11087), class = "Date")), .Names = c("Date", 
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))

df2

structure(list(Date1 = structure(c(10954, 10955, 10956, 10957, 
10958, 10959), class = "Date")), .Names = "Date1", row.names = c(NA, 
-6L), class = c("data.table", "data.frame"))

If I now execute the following code:

> df=df[,interval := interval(df$Date, df$Stop.Date)]
> 
> df1 <- do.call(rbind, lapply(df2$Date1, function(x){   index <- x
> %within% df$interval;   list(ID = ifelse(any(index), df$ID[index],
> NA), Rec = ifelse(any(index), df$Rec[index], NA), 
>        Name = ifelse(any(index), df$Name[index], NA),interval = ifelse(any(index),df$interval[index],NA))})) 
> 
> df3 <- cbind(df2, df1)

I come up with the following result:

     Date1        ID        Rec  Name interval
1: 1999-12-29 BMG4593F1389   2 ING FM 189216000
2: 1999-12-30 BMG4593F1389   2 ING FM 189216000
3: 1999-12-31 BMG4593F1389   2 ING FM 189216000
4: 2000-01-01 BMG4593F1389   2 ING FM 189216000
5: 2000-01-02 BMG4593F1389   2 ING FM 189216000
6: 2000-01-03 BMG4593F1389   2 ING FM 189216000

But since e.g the df2$Date1 ("1999-12-29") for the df$ID "BMG4593F1389" falls within the date range of 6 more entries in df (for different df$Names) FOR THIS particular df$date1 it should be:

Expected result for Date 1999-12-29 (df3$interval variable neglected here for simplicity)

         Date1        ID        Rec         Name 
    1: 1999-12-29 BMG4593F1389   2   ING FM 
    2: 1999-12-29 BMG4593F1389   3   Permission Denied 128064 
    3: 1999-12-29 BMG4593F1389   2   Permission Denied 2880
    4: 1999-12-29 BMG4593F1389   3   Permission Denied 32
    5: 1999-12-29 BMG4593F1389   3   Permission Denied 888

    6: 1999-12-29 BMG5265510042  3   Permission Denied 2880

    7: 1999-12-30 BMG4593F1389   2   ING FM
... etc

So at the end I need the Dates in df$Date1 replicated if more than one name issues a Rec for a specific df$ID which falls within the respective date range.

Can somebody help me with that?


回答1:


If I understand the updated version of the question correctly, this can be solved using a non-equi join and subsequent aggregation:

library(data.table)
# non-equi join
df[df2, on = .(Date <= Date1, Stop.Date > Date1), allow = TRUE][
  # aggregation
  , .(sumRec = sum(Rec)), by = .(Date, ID, Name)]
          Date           ID                     Name sumRec
 1: 1999-12-29 BMG4593F1389                   ING FM      2
 2: 1999-12-29 BMG4593F1389 Permission Denied 128064      3
 3: 1999-12-29 BMG4593F1389   Permission Denied 2880      2
 4: 1999-12-29 BMG4593F1389     Permission Denied 32      3
 5: 1999-12-29 BMG4593F1389    Permission Denied 888      3
 6: 1999-12-29 BMG526551004   Permission Denied 2880      3
 7: 1999-12-30 BMG4593F1389                   ING FM      2
 8: 1999-12-30 BMG4593F1389 Permission Denied 128064      3
 9: 1999-12-30 BMG4593F1389   Permission Denied 2880      2
10: 1999-12-30 BMG4593F1389     Permission Denied 32      3
11: 1999-12-30 BMG4593F1389    Permission Denied 888      3
12: 1999-12-30 BMG526551004   Permission Denied 2880      3
13: 1999-12-31 BMG4593F1389                   ING FM      2
14: 1999-12-31 BMG4593F1389 Permission Denied 128064      3
15: 1999-12-31 BMG4593F1389   Permission Denied 2880      2
16: 1999-12-31 BMG4593F1389     Permission Denied 32      3
17: 1999-12-31 BMG4593F1389    Permission Denied 888      3
18: 1999-12-31 BMG526551004   Permission Denied 2880      3
19: 2000-01-01 BMG4593F1389                   ING FM      2
20: 2000-01-01 BMG4593F1389 Permission Denied 128064      3
21: 2000-01-01 BMG4593F1389   Permission Denied 2880      2
22: 2000-01-01 BMG4593F1389     Permission Denied 32      3
23: 2000-01-01 BMG4593F1389    Permission Denied 888      3
24: 2000-01-01 BMG526551004   Permission Denied 2880      3
25: 2000-01-02 BMG4593F1389                   ING FM      2
26: 2000-01-02 BMG4593F1389 Permission Denied 128064      3
27: 2000-01-02 BMG4593F1389   Permission Denied 2880      2
28: 2000-01-02 BMG4593F1389     Permission Denied 32      3
29: 2000-01-02 BMG4593F1389    Permission Denied 888      3
30: 2000-01-02 BMG526551004   Permission Denied 2880      3
31: 2000-01-03 BMG4593F1389                   ING FM      2
32: 2000-01-03 BMG4593F1389 Permission Denied 128064      3
33: 2000-01-03 BMG4593F1389   Permission Denied 2880      2
34: 2000-01-03 BMG4593F1389     Permission Denied 32      3
35: 2000-01-03 BMG4593F1389    Permission Denied 888      3
36: 2000-01-03 BMG526551004   Permission Denied 2880      3
          Date           ID                     Name sumRec

Please, note that I experienced a strange error message when using df as provided in structure(...) directly. The error message went away after calling

df <- as.data.table(df)

Explanation

I was asked to explain how the non-equi join works. Non-equi joins are an extension of the data.table joins. data.table is a package which enhances base R's data.frame.

Here, we right join df2 with df, i.e., we want to see all rows of df2 with matches in df in the result but only those where Date1 (from df2) lies between Date and Stop.Date (from df), Date <= Date1 < Stop.Date to be exact. As there are many possible matches, we need to use allow.cartesian = TRUE.

There is a video of Arun's talk at the useR! 2016 international R User conference introducing Efficient in-memory non-equi joins using data.table.



来源:https://stackoverflow.com/questions/49662243/r-sum-by-group-if-date-within-date-range

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!