R, dplyr: cumulative version of n_distinct

后端 未结 4 1522

I have a dataframe as follows. It is ordered by column time.

Input -

df = data.frame(time = 1:20,
            grp = sort(rep(1:5,4)),
             


        
4条回答
  •  旧时难觅i
    2021-02-09 03:49

    Assuming stuff is ordered by time already, first define a cumulative distinct function:

    dist_cum <- function(var)
      sapply(seq_along(var), function(x) length(unique(head(var, x))))
    

    Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:

    transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
    

    A data.table solution, basically doing the same thing:

    library(data.table)
    (data.table(df)[, var2:=dist_cum(var1), by=grp])
    

    And dplyr, again, same thing:

    library(dplyr)
    df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
    

提交回复
热议问题