Using cummean with group_by and ignoring NAs

送分小仙女□ 提交于 2021-02-07 10:28:10

问题


df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

I'd like to add a new column to the above dataframe which takes the cumulative mean of the value column, not taking into account NAs. Is it possible to do this with dplyr? I've tried

df <- df %>% group_by(category) %>% mutate(new_col=cummean(value))

but cummean just doesn't know what to do with NAs.

EDIT: I do not want to count NAs as 0.


回答1:


You could use ifelse to treat NAs as 0 for the cummean call:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category) %>%
  mutate(new_col = cummean(ifelse(is.na(value), 0, value)))

Output:

# A tibble: 8 x 3
# Groups:   category [2]
  category value new_col
  <fct>    <dbl>   <dbl>
1 cat1       NA     0.  
2 cat1        2.    1.00
3 cat2        3.    3.00
4 cat1        4.    2.00
5 cat2        5.    4.00
6 cat2       NA     2.67
7 cat1        7.    3.25
8 cat2        8.    4.00

EDIT: Now I see this isn't the same as ignoring NAs.

Try this one instead. I group by a column which specifies if the value is NA or not, meaning cummean can run without encountering any NAs:

library(dplyr)

df <- data.frame(category=c("cat1","cat1","cat2","cat1","cat2","cat2","cat1","cat2"),
                 value=c(NA,2,3,4,5,NA,7,8))

df %>%
  group_by(category, isna = is.na(value)) %>%
  mutate(new_col = ifelse(isna, NA, cummean(value)))

Output:

# A tibble: 8 x 4
# Groups:   category, isna [4]
  category value isna  new_col
  <fct>    <dbl> <lgl>   <dbl>
1 cat1       NA  TRUE    NA   
2 cat1        2. FALSE    2.00
3 cat2        3. FALSE    3.00
4 cat1        4. FALSE    3.00
5 cat2        5. FALSE    4.00
6 cat2       NA  TRUE    NA   
7 cat1        7. FALSE    4.33
8 cat2        8. FALSE    5.33



回答2:


An option is to remove value before calculating cummean. In this method rows with NA value will not be accounted for cummean calculation. Not sure if OP wants to consider NA value as 0 in calculation.

df %>% mutate(rn = row_number()) %>%
  filter(!is.na(value)) %>%
  group_by(category) %>%
  mutate(new_col = cummean(value)) %>%
  ungroup() %>% 
  right_join(mutate(df, rn = row_number()), by="rn") %>%
  select(category = category.y, value = value.y, new_col) %>%
  as.data.frame()
#    category value  new_col
# 1     cat1    NA       NA
# 2     cat1     2 2.000000
# 3     cat2     3 3.000000
# 4     cat1     4 3.000000
# 5     cat2     5 4.000000
# 6     cat2    NA       NA
# 7     cat1     7 4.333333
# 8     cat2     8 5.333333


来源:https://stackoverflow.com/questions/49906503/using-cummean-with-group-by-and-ignoring-nas

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!