Using cummean with group_by and ignoring NAs

问题

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried

df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))

but cummean just doesn't know what to do with NAs.

EDIT: I do not want to count NAs as 0.

回答1:

You could use ifelse to treat NAs as 0 for the cummean call:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category) %>%
  mutate(new_col = cummean(ifelse(is.na(value), 0, value)))

Output:

# A tibble: 8 x 3
# Groups:   category [2]
  category value new_col
  <fct>    <dbl>   <dbl>
1 cat1       NA     0.  
2 cat1        2.    1.00
3 cat2        3.    3.00
4 cat1        4.    2.00
5 cat2        5.    4.00
6 cat2       NA     2.67
7 cat1        7.    3.25
8 cat2        8.    4.00

EDIT: Now I see this isn't the same as ignoring NAs.

Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category, isna = is.na(value)) %>%
  mutate(new_col = ifelse(isna, NA, cummean(value)))

Output:

# A tibble: 8 x 4
# Groups:   category, isna [4]
  category value isna  new_col
  <fct>    <dbl> <lgl>   <dbl>
1 cat1       NA  TRUE    NA   
2 cat1        2. FALSE    2.00
3 cat2        3. FALSE    3.00
4 cat1        4. FALSE    3.00
5 cat2        5. FALSE    4.00
6 cat2       NA  TRUE    NA   
7 cat1        7. FALSE    4.33
8 cat2        8. FALSE    5.33

回答2:

An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.

df %>% mutate(rn = row_number()) %>%
  filter(!is.na(value)) %>%
  group_by(category) %>%
  mutate(new_col = cummean(value)) %>%
  ungroup() %>% 
  right_join(mutate(df, rn = row_number()), by="rn") %>%
  select(category = category.y, value = value.y, new_col) %>%
  as.data.frame()
#    category value  new_col
# 1     cat1    NA       NA
# 2     cat1     2 2.000000
# 3     cat2     3 3.000000
# 4     cat1     4 3.000000
# 5     cat2     5 4.000000
# 6     cat2    NA       NA
# 7     cat1     7 4.333333
# 8     cat2     8 5.333333

来源：https://stackoverflow.com/questions/49906503/using-cummean-with-group-by-and-ignoring-nas

标签

dplyr