问题
I need to create a counter variable depending on three other variables.
This is an extension question of this one.extension question Consider the situations of multiple consumers place order in Amazon. I want to count the successful order times by each user. If you have placed order successfully, the counter variable self plus one;if it is a failed order, the counter remains the same. Obviously, the counter variable will be depend on the time,order status and user.
Please consider the scenario of when t is the same but the order status is different,which does not mean the row is duplicate, it has other columns that are different.
DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT
The desired counter output is as follow. The 'output' is the counter variable.
time user order_status output
1: 1 1 f 0
2: 2 1 f 0
3: 2 1 t 1
4: 2 1 t 1
5: 1 2 f 0
6: 1 3 f 0
7: 2 3 t 1
8: 3 3 t 2
9: 1 4 t 1
10: 1 4 t 1
回答1:
The main challenge here is to set the first occurrence of every combination of time, user, order_status=='t'
to 1. Then it's a simple cumulative sum grouped by user
.
Here are two ways to accomplish this using data.table
:
Method 1:
DT[, id := 0L
][order_status == "t", id := c(1L, rep(0L, .N-1L)), by=names(DT)
][, id := cumsum(id), by=user]
The 2nd line here marks the first occurrence by 1
only when order_status == "t"
.
A heavily commented production code of mine would look something like this:
DT[, id := 0L # set entire id col to 0
][order_status == "t", # then, where order status is true
id := c(1L, rep(0L, .N-1L)), # set (or update) first value to 1
by = names(DT) # for every time,user,order_status
][, id := cumsum(id), # then, get cumulative sum of id
by = user] # for every user
Method 2: Using data.table's join+update:
DT[, id := 0L
][DT, id := as.integer(order_status == "t"), mult="first", on=names(DT)
][, id := cumsum(id), by=user]
The 2nd step here does the same as in method 1, but it directly identifies the first occurrence and updates it to 1
if order_status == "t"
by performing an update on a join based subset. You can replace the DT
on the inside with unique(DT)
so as to remove redundancy.
If I've to, I'd say 1st method is more efficient, since creating a rep()
for each group should be quite fast, as opposed to a join+update. But I find the 2nd method more understandable to identify as to what the actual operation is, which I think is more important if you were to look at your code several weeks after.
回答2:
A simple approach using data.table
is:
DT[,output := cumsum(order_status=="t" & !duplicated(cbind(time,user,order_status)))
,by=.(user)]
time user order_status output
1: 1 1 f 0
2: 2 1 f 0
3: 2 1 t 1
4: 2 1 t 1
5: 1 2 f 0
6: 1 3 f 0
7: 2 3 t 1
8: 3 3 t 2
9: 1 4 t 1
10: 1 4 t 1
This approach will basically fill in the last "t" value for any "f" values. If you want to make all "f" values 0, that is easy enough as well - just change the by=...
to be by=.(user,order_status)
.
回答3:
The most readable way is probably a subquery.
library(data.table)
library(dplyr)
DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT %>% left_join(
DT %>%
filter(order_status == "t") %>%
group_by(user, time) %>%
summarise() %>%
arrange(time) %>%
mutate(output = row_number()),
by = c("user", "time")) %>%
mutate(output = ifelse(is.na(output), 0, output))
NB using tidyr
you can replace the last mutate
by replace_na(list(output = 0))
.
来源:https://stackoverflow.com/questions/38900796/create-cumulative-counter-variable-per-user-with-multiple-conditions