dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

柔情痞子 提交于 2019-12-03 16:25:11

I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.


I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:

df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...

This is just so that you're aware of it for any future cases.


IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:

df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
             md_X = median(X[Week >=5 & Week <= 43]), 
             mn_X = mean(X[Week >=5 & Week <= 43]))

Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.

What I've in mind is something like:

df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
              { tmp = Week >= 5 & Week <= 43;
                md_X = ifelse(length(tmp), median(X[tmp]), 0), 
                md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
              })   

So, we'll have to workaround in this manner probably:

df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0), 
                  mn_X = ifelse(tmp[1L], mean(X), 0))

Or to put things together:

df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43, 
       md_X = ifelse(tmp[1L], median(X), 0), 
       mn_X = ifelse(tmp[1L], median(X), 0)) 

#     S  D    Y Week        X   tmp     md_X     mn_X
# 1  10 20 2005    6 22107.73  TRUE 22107.73 22107.73
# 2  10 23 2005   32 18751.98  TRUE 18751.98 18751.98
# 3  10 25 2005   33 31027.90  TRUE 31027.90 31027.90
# 4  10 26 2005    0 46586.33 FALSE     0.00     0.00
# 5  11 20 2006   12 43253.80  TRUE 43253.80 43253.80
# 6  11 22 2006   27 28243.66  TRUE 28243.66 28243.66
# 7  11 23 2006   36 20607.47  TRUE 20607.47 20607.47
# 8  11 24 2006   28 22186.89  TRUE 22186.89 22186.89
# 9  11 25 2006   15 30292.27  TRUE 30292.27 30292.27
# 10 12 20 2007   15 40386.83  TRUE 40386.83 40386.83
# 11 12 21 2007   44 18049.92 FALSE     0.00     0.00
# 12 12 26 2007   16 35856.24  TRUE 35856.24 35856.24

which doesn't require df.tmp.

HTH

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!