I have a large data set that has unique IDs for individuals as well as dates, and each individual is capable of multiple encounters.
The below is code and an example of how this data might look:
strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16", "6/8/16", "5/8/16","2/3/16","1/1/16") Date<-as.Date(strDates, "%m/%d/%y") ID <- c("A", "A", "A", "A","A","B","B","B","B","B") Event <- c(1,0,1,0,1,0,1,1,1,0) sample_df <- data.frame(Date,ID,Event) sample_df Date ID Event 1 2016-09-09 A 1 2 2016-06-07 A 0 3 2016-05-06 A 1 4 2016-02-03 A 0 5 2016-02-01 A 1 6 2016-11-08 B 0 7 2016-06-08 B 1 8 2016-05-08 B 1 9 2016-02-03 B 1 10 2016-01-01 B 0
I want to keep all attached information per encounter, but then aggregate the following historical information by id
- Number of Previous Encounters
- Number of Previous Events
As an example, let's look at Row 2.
Row 2 is ID A, so I would reference Rows 3-5 (which occurred prior to Row 2 Encounter). Within this group of rows, we see that Row 3 & 5 both had events.
Number of Previous Encounters for Row 2 = 3
Number of Previous Events for Row 2 = 2
Ideally, I would get the following output:
Date ID Event PrevEnc PrevEvent 1 2016-09-09 A 1 4 2 2 2016-06-07 A 0 3 2 3 2016-05-06 A 1 2 1 4 2016-02-03 A 0 1 1 5 2016-02-01 A 1 0 0 6 2016-11-08 B 0 4 3 7 2016-06-08 B 1 3 2 8 2016-05-08 B 1 2 1 9 2016-02-03 B 1 1 0 10 2016-01-01 B 0 0 0
So far, I have tried working this problem in dplyr with mutate as well as summarise, both of which have not let me successfully restrict my aggregation to events that occurred previously for a specific ID. I have tried some messy For-loops with If-then statements, but really just wondering if a package or technique exists to simplify this process.
Thank you!
The biggest impediment is the current sort order. Here, I stored an original index point, which I later used to re-sort the data (then removed it). Other than that, the basic idea is to count up from 0 for the encounters, and to use cumsum
to count the events as they happen. To that end, lag
is used to avoid counting the current event.
sample_df %>% mutate(origIndex = 1:n()) %>% group_by(ID) %>% arrange(ID, Date) %>% mutate(PrevEncounters = 0:(n() -1) , PrevEvents = cumsum(lag(Event, default = 0))) %>% arrange(origIndex) %>% select(-origIndex)
Gives
Date ID Event PrevEncounters PrevEvents <date> <fctr> <dbl> <int> <dbl> 1 2016-09-09 A 1 4 2 2 2016-06-07 A 0 3 2 3 2016-05-06 A 1 2 1 4 2016-02-03 A 0 1 1 5 2016-02-01 A 1 0 0 6 2016-11-08 B 0 4 3 7 2016-06-08 B 1 3 2 8 2016-05-08 B 1 2 1 9 2016-02-03 B 1 1 0 10 2016-01-01 B 0 0 0
Or, if you want to try with data.table
, you can use this:
library(data.table) # Convert to data.table and sort sample_dt <- as.data.table(sample_df) sample_dt <- sample_dt[order(Date)] # Count only the previous Events with 1 sample_dt[, prevEvent := ifelse(Event == 1, cumsum(Event) - 1, cumsum(Event)), by = "ID"] # .I gives the row number, and .SD contains the Subset of the Data for each group sample_dt[, prevEnc := .SD[,.I - 1], by = "ID"] print(sample_dt) Date ID Event prevEvent prevEnc 1: 2016-01-01 B 0 0 0 2: 2016-02-01 A 1 0 0 3: 2016-02-03 A 0 1 1 4: 2016-02-03 B 1 0 1 5: 2016-05-06 A 1 1 2 6: 2016-05-08 B 1 1 2 7: 2016-06-07 A 0 2 3 8: 2016-06-08 B 1 2 3 9: 2016-09-09 A 1 2 4 10: 2016-11-08 B 0 3 4
If you didn't know this package, there is an excellent cheat sheet for most of the operations.
As @Frank and @MarkPeterson point out, the biggest hurdle here is that the Date
column is sorted in descending order. Another approach that does not require resorting the Date
column:
library(dplyr) res <- sample_df %>% group_by(ID) %>% mutate(PrevEnc=n()-row_number(), PrevEvent=rev(cumsum(lag(rev(Event), default=0))))
Here, we use row_number()
to determine the row index and n()
to determine the number of rows (grouped by ID
). Since the Date
is sorted in descending order, the number of previous encounter is simply n()-row_number()
. To compute the number of previous events, we again exploit the fact that the Date
column is sorted in descending order and use rev
to reverse the order of the Event
column before cumsum
the lag
of this reversed column. We then use rev
again to reverse the result back to the original order.
Using your data:
print(res) ##Source: local data frame [10 x 5] ##Groups: ID [2] ## ## Date ID Event PrevEnc PrevEvent ## <date> <fctr> <dbl> <int> <dbl> ##1 2016-09-09 A 1 4 2 ##2 2016-06-07 A 0 3 2 ##3 2016-05-06 A 1 2 1 ##4 2016-02-03 A 0 1 1 ##5 2016-02-01 A 1 0 0 ##6 2016-11-08 B 0 4 3 ##7 2016-06-08 B 1 3 2 ##8 2016-05-08 B 1 2 1 ##9 2016-02-03 B 1 1 0 ##10 2016-01-01 B 0 0 0