Count distinct by group- moving window

橙三吉。 提交于 2020-05-08 15:38:11

问题


Let's say I have a dataset contain visits in a hospital. My goal is to generate a variable that counts the number of unique patients the visitor has seen before at the date of the visit. I often work with group_by by dplyr but this seems a little tricky. I guess I would have to use group_by, n_distinct, and sum or some kind moving window command. The "goal" variable is what I need.

visitor visitdt patient goal
125469  1/12/2018   15200   1
125469  1/19/2018   15200   1
125469  2/16/2018   15200   1
125469  2/23/2018   52607   2
125469  3/9/2018    52607   2
125469  3/16/2018   52607   2
125469  3/23/2018   15200   2
125469  3/29/2018   15200   2
125469  3/30/2018   20589   3
125469  4/6/2018    20589   3

Thanks, Marvin


回答1:


You can do:

with(df, ave(patient, visitor, FUN = function(x) cumsum(!duplicated(x))))

 [1] 1 1 1 2 2 2 2 2 3 3

Essentially, it is a cumulative sum of non-duplicated values per group.

And you can also do the same with dplyr:

df %>%
 group_by(visitor) %>%
 mutate(res = cumsum(!duplicated(patient)))



回答2:


We can use dplyr

library(dplyr)   
df1 %>%
   group_by(visitor) %>%
    mutate(goal = cummax(match(patient, unique(patient))))
    #or with factor
    # mutate(goal1 = cummax(as.integer(factor(patient, levels = unique(patient)))))

# A tibble: 10 x 4
# Groups:   visitor [1]
#   visitor visitdt   patient  goal
#     <int> <chr>       <int> <int>
# 1  125469 1/12/2018   15200     1
# 2  125469 1/19/2018   15200     1
# 3  125469 2/16/2018   15200     1
# 4  125469 2/23/2018   52607     2
# 5  125469 3/9/2018    52607     2
# 6  125469 3/16/2018   52607     2
# 7  125469 3/23/2018   15200     2
# 8  125469 3/29/2018   15200     2
# 9  125469 3/30/2018   20589     3
#10  125469 4/6/2018    20589     3

data

df1 <- structure(list(visitor = c(125469L, 125469L, 125469L, 125469L, 
125469L, 125469L, 125469L, 125469L, 125469L, 125469L), visitdt = c("1/12/2018", 
"1/19/2018", "2/16/2018", "2/23/2018", "3/9/2018", "3/16/2018", 
"3/23/2018", "3/29/2018", "3/30/2018", "4/6/2018"), patient = c(15200L, 
15200L, 15200L, 52607L, 52607L, 52607L, 15200L, 15200L, 20589L, 
20589L), goal = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L)),
class = "data.frame", row.names = c(NA, 
-10L))



回答3:


Sounds important with what you are tracking. Another option using data.table using non-equi join and then update by reference:

DT[, goal2 :=
    DT[.SD, on=.(visitor, visitdt<=visitdt), allow.cartesian=TRUE, 
        length(unique(patient)), by=.EACHI]$V1]

output:

    visitor    visitdt patient goal goal2
 1:  125469 2018-01-12   15200    1     1
 2:  125469 2018-01-19   15200    1     1
 3:  125469 2018-02-16   15200    1     1
 4:  125469 2018-02-23   52607    2     2
 5:  125469 2018-03-09   52607    2     2
 6:  125469 2018-03-16   52607    2     2
 7:  125469 2018-03-23   15200    2     2
 8:  125469 2018-03-29   15200    2     2
 9:  125469 2018-03-30   20589    3     3
10:  125469 2018-04-06   20589    3     3

data:

library(data.table)
DT <- fread("visitor visitdt patient goal
125469  1/12/2018   15200   1
125469  1/19/2018   15200   1
125469  2/16/2018   15200   1
125469  2/23/2018   52607   2
125469  3/9/2018    52607   2
125469  3/16/2018   52607   2
125469  3/23/2018   15200   2
125469  3/29/2018   15200   2
125469  3/30/2018   20589   3
125469  4/6/2018    20589   3")
DT[, visitdt := as.Date(visitdt, "%m/%d/%Y")]


来源:https://stackoverflow.com/questions/58222809/count-distinct-by-group-moving-window

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!