Calculate days since last event in R

后端 未结 5 855
别那么骄傲
别那么骄傲 2020-12-17 10:04

My question involves how to calculate the number of days since an event last that occurred in R. Below is a minimal example of the data:

df <- data.frame         


        
相关标签:
5条回答
  • 2020-12-17 10:41

    It's painful and you lose performance but you can do it with a for loop :

    datas <- read.table(text = "date event
    2000-07-06     0
    2000-09-15     0
    2000-10-15     1
    2001-01-03     0
    2001-03-17     1
    2001-05-23     1
    2001-08-26     0", header = TRUE, stringsAsFactors = FALSE)
    
    
    datas <- transform(datas, date = as.Date(date))
    
    lastEvent <- NA
    tae <- rep(NA, length(datas$event))
    for (i in 2:length(datas$event)) {
      if (datas$event[i-1] == 1) {
        lastEvent <- datas$date[i-1]
      }
      tae[i] <- datas$date[i] - lastEvent
    
      # To set the first occuring event as 0 and not NA
      if (datas$event[i] == 1 && sum(datas$event[1:i-1] == 1) == 0) {
        tae[i] <- 0
      }
    }
    
    cbind(datas, tae)
    
    date event tae
    1 2000-07-06     0  NA
    2 2000-09-15     0  NA
    3 2000-10-15     1   0
    4 2001-01-03     0  80
    5 2001-03-17     1 153
    6 2001-05-23     1  67
    7 2001-08-26     0  95
    
    0 讨论(0)
  • 2020-12-17 10:46

    I had a similar issue and was able to solve it combining some of the ideas above. The main difference I had with mine would be customers a - nth would have different events (for me it is purchases). I wanted to know the cumulative totals for all these purchases as well as the date of the last activity. The main way I solved this was to create an index-dataframe to join with the main data frame. Similar to the top rated question above. See repeatable code below.

    library(tidyverse)
    rm(list=ls())
    
    #generate repeatable code sample dataframe
    df <- as.data.frame(sample(rep(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 12), each = 4),36))
    df$subtotal <- sample(1:100, 36)
    df$cust <- sample(rep(c("a", "b", "c", "d", "e", "f"), each=12), 36)
    
    colnames(df) <- c("dates", "subtotal", "cust")
    
    #add a "key" based on date and event
    df$datekey <- paste0(df$dates, df$cust)
    
    #The following 2 lines are specific to my own analysis but added to show depth
    df_total_visits <- df %>% select(dates, cust) %>% distinct() %>% group_by(cust) %>% tally(n= "total_visits") %>% mutate(variable = 1)
    df_order_bydate <-   df %>% select(dates, cust) %>% group_by(dates, cust) %>% tally(n= "day_orders") 
    
    
    df <- left_join(df, df_total_visits)
    df <- left_join(df, df_order_bydate) %>% arrange(dates)
    
    # Now we will add the index, the arrange from the previous line is super important if your data is not already ordered by date
    cummulative_groupping <- df %>% select(datekey, cust, variable, subtotal) %>% group_by(datekey) %>% mutate(spending = sum(subtotal)) %>% distinct(datekey, .keep_all = T) %>% select(-subtotal)
    cummulative_groupping <- cummulative_groupping %>% group_by(cust) %>% mutate(cumulative_visits = cumsum(variable),
                                                                                        cumulative_spend = cumsum(spending))
    
    df <- left_join(df, cummulative_groupping) %>% select(-variable)
    
    #using the cumulative visits as the index, if we add one to this number we can then join it again on our dataframe
    last_date_index <- df %>% select(dates, cust, cumulative_visits)
    last_date_index$cumulative_visits <- last_date_index$cumulative_visits + 1 
    colnames(last_date_index) <- c("last_visit_date", "cust", "cumulative_visits")
    df <- left_join(df, last_date_index, by = c("cust", "cumulative_visits"))
    
    
    #the difference between the date and last visit answers the original posters question.  NAs will return as NA
    df$toa <- df$dates - df$last_visit_date
    

    This answer works in the cases where the same event occurs on the same day (either bad data hygiene OR if multiple vendors/cust go to that event). Thank you for viewing my answer. This is actually my first post on Stack.

    0 讨论(0)
  • 2020-12-17 10:51

    I'm way late to the party, but I used tidyr::fill to make this easier. You essentially convert your non-events to missing values, then use fill to fill the NAs in with the last event, then subtract the current date from the last event.

    I've tested this with a integer date column, so it might need some tweaking for a Date-type date column (especially the use of NA_integer_. I'm not sure what the underlying type is for Date objects; I'm guessing NA_real_.)

    df %>%
      mutate(
        event = as.logical(event),
        last_event = if_else(event, true = date, false = NA_integer_)) %>%
      fill(last_event) %>%
      mutate(event_age = date - last_event)
    
    0 讨论(0)
  • 2020-12-17 11:04

    Old question, but I was experimenting with rolling joins and found this interesting.

    library(data.table)
    setDT(df)
    setkey(df, date)
    
    # rolling self-join to attach last event time
    df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
    
    # find difference between record and previous event == 1 record
    df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
    
    # difftime for simple case between date and joint on previous event
    df[event == 0, tae:= difftime(date, lastevent, unit = "days")]
    
    > df
             date  lastevent event      tae
    1: 2000-07-06       <NA>     0  NA days
    2: 2000-09-15       <NA>     0  NA days
    3: 2000-10-15 2000-10-15     1  NA days
    4: 2001-01-03 2000-10-15     0  80 days
    5: 2001-03-17 2001-03-17     1 153 days
    6: 2001-05-23 2001-05-23     1  67 days
    7: 2001-08-26 2001-05-23     0  95 days
    
    0 讨论(0)
  • 2020-12-17 11:05

    You could try something like this:

    # make an index of the latest events
    last_event_index <- cumsum(df$event) + 1
    
    # shift it by one to the right
    last_event_index <- c(1, last_event_index[1:length(last_event_index) - 1])
    
    # get the dates of the events and index the vector with the last_event_index, 
    # added an NA as the first date because there was no event
    last_event_date <- c(as.Date(NA), df[which(df$event==1), "date"])[last_event_index]
    
    # substract the event's date with the date of the last event
    df$tae <- df$date - last_event_date
    df
    
    #        date event      tae
    #1 2000-07-06     0  NA days
    #2 2000-09-15     0  NA days
    #3 2000-10-15     1  NA days
    #4 2001-01-03     0  80 days
    #5 2001-03-17     1 153 days
    #6 2001-05-23     1  67 days
    #7 2001-08-26     0  95 days
    
    0 讨论(0)
提交回复
热议问题