Create cumulative counter variable per-user, with multiple conditions

问题

I need to create a counter variable depending on three other variables.

This is an extension question of this one.extension question Consider the situations of multiple consumers place order in Amazon. I want to count the successful order times by each user. If you have placed order successfully, the counter variable self plus one;if it is a failed order, the counter remains the same. Obviously, the counter variable will be depend on the time,order status and user.

Please consider the scenario of when t is the same but the order status is different,which does not mean the row is duplicate, it has other columns that are different.

DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT

The desired counter output is as follow. The 'output' is the counter variable.

    time user order_status output
 1:    1    1            f      0
 2:    2    1            f      0
 3:    2    1            t      1
 4:    2    1            t      1
 5:    1    2            f      0
 6:    1    3            f      0
 7:    2    3            t      1
 8:    3    3            t      2
 9:    1    4            t      1
10:    1    4            t      1

回答1:

The main challenge here is to set the first occurrence of every combination of time, user, order_status=='t' to 1. Then it's a simple cumulative sum grouped by user.

Here are two ways to accomplish this using data.table:

Method 1:

DT[, id := 0L
  ][order_status == "t", id := c(1L, rep(0L, .N-1L)), by=names(DT)
   ][, id := cumsum(id), by=user]

The 2nd line here marks the first occurrence by 1 only when order_status == "t".

A heavily commented production code of mine would look something like this:

DT[, id := 0L                       # set entire id col to 0
  ][order_status == "t",            # then, where order status is true
      id := c(1L, rep(0L, .N-1L)),  # set (or update) first value to 1
      by = names(DT)                # for every time,user,order_status
   ][, id := cumsum(id),            # then, get cumulative sum of id
       by = user]                   # for every user

Method 2: Using data.table's join+update:

DT[, id := 0L
  ][DT, id := as.integer(order_status == "t"), mult="first", on=names(DT)
   ][, id := cumsum(id), by=user]

The 2nd step here does the same as in method 1, but it directly identifies the first occurrence and updates it to 1 if order_status == "t" by performing an update on a join based subset. You can replace the DT on the inside with unique(DT) so as to remove redundancy.

If I've to, I'd say 1st method is more efficient, since creating a rep() for each group should be quite fast, as opposed to a join+update. But I find the 2nd method more understandable to identify as to what the actual operation is, which I think is more important if you were to look at your code several weeks after.

回答2:

A simple approach using data.table is:

DT[,output := cumsum(order_status=="t" & !duplicated(cbind(time,user,order_status)))
   ,by=.(user)]

    time user order_status output
 1:    1    1            f      0
 2:    2    1            f      0
 3:    2    1            t      1
 4:    2    1            t      1
 5:    1    2            f      0
 6:    1    3            f      0
 7:    2    3            t      1
 8:    3    3            t      2
 9:    1    4            t      1
10:    1    4            t      1

This approach will basically fill in the last "t" value for any "f" values. If you want to make all "f" values 0, that is easy enough as well - just change the by=... to be by=.(user,order_status).

回答3:

The most readable way is probably a subquery.

library(data.table)
library(dplyr)
DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT %>% left_join(
  DT %>%
    filter(order_status == "t") %>%
    group_by(user, time) %>%
    summarise() %>%
    arrange(time) %>%
    mutate(output = row_number()),
  by = c("user", "time")) %>%
  mutate(output = ifelse(is.na(output), 0, output))

NB using tidyr you can replace the last mutate by replace_na(list(output = 0)).

来源：https://stackoverflow.com/questions/38900796/create-cumulative-counter-variable-per-user-with-multiple-conditions

标签

data.table

dplyr

cumulative-sum