dplyr: Find mean for each bin by groups

匿名 (未验证) 提交于 2019-12-03 02:29:01

问题:

I am trying to understand dplyr. I am splitting values in my data frame by group, bins and by sign, and I am trying to get a mean value for each group/bin/sign combination. I would like to output a data frame with these counts per each group/bin/sign combination, and the total numbers per each group. I think I have it but sometimes I get different values in base R compared to the output of ddplyr. Am I doing this correctly? It is also very contorted...is there a more direct way?

library(ggplot2) df <-  data.frame( id = sample(LETTERS[1:3], 100, replace=TRUE), tobin = rnorm(1000), value = rnorm(1000) ) df$tobin[sample(nrow(df), 10)]=0  df$bin = cut_interval(abs(df$tobin), length=1) df$sign = ifelse(df$tobin==0, "NULL", ifelse(df$tobin>0, "-", "+"))   # Find mean of value by group, bin, and sign using dplyr library(dplyr) res <- df %>% group_by(id, bin, sign) %>%         summarise(Num = length(bin), value=mean(value,na.rm=TRUE))          res %>% group_by(id) %>%                 summarise(total= sum(Num))             res=data.frame(res)             total=data.frame(total)             res$total = total[match(res$id, total$id),"total"]              res[res$id=="A" & res$bin=="[0,1]" & res$sign=="NULL",]  # Check in base R if mean by group, bin, and sign is correct # Sometimes not? groupA = df[df$id=="A" & df$bin=="[0,1]" & df$sign=="NULL",] mean(groupA$value, na.rm=T) 

I am going crazy because it doesn't work on my data, and this command just repeats the mean of the whole dataset:

ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE)) 

Where mean is equal to mean(value,na.rm=TRUE), completely ignoring the grouping...All the groups are factors, and the value is numeric...

This however works:

with(df, aggregate(df$value, by = list(id, bin, sign), FUN = function(x) c(mean(x)))) 

Please help me..

回答1:

You seem to be flailing a bit. You've got correct code, then you've got extra code.

Starting from a fresh R session and defining your data, then

library(dplyr) res <- df %>% group_by(id, bin, sign) %>%         summarise(Num = n(), value = mean(value,na.rm=TRUE)) 

The above code is from your question, but I replaced length(bin) with the built-in dplyr::n() function. The above code accurately gives the group-wise averages:

head(res) #   id   bin sign Num       value # 1  A [0,1]    - 122 -0.08330338 # 2  A [0,1]    + 111  0.11394381 # 3  A [0,1] NULL   2  0.75232462 # 4  A (1,2]    -  54 -0.09236725 # 5  A (1,2]    +  45  0.20581095 # 6  A (2,3]    -  12 -0.08998771 

Jumping ahead to your last couple lines in the code block:

groupA = df[df$id=="A" & df$bin=="[0, 1]" & df$sign=="NULL", ] # mean(groupA$value, na.rm=T) # [1] 0.7523246 

Which matches the 3rd line of the above results. So you did it, it works fine!

The rest of your code is confused:

res %>% group_by(id) %>%                 summarise(total= sum(Num)) 

I'm not sure what you're trying to accomplish with this, but you don't assign it to anything so it is run but not saved.

As for your ddply attempt:

ddply(df, .(id, bin, sign), summarize, mean = mean(value,na.rm=TRUE)) 

You'll notice that if you have dplyr loaded and then load the plyr library, there's a message that:

You have loaded plyr after dplyr - this is likely to cause problems. If you need functions from both plyr and dplyr, please load plyr first, then dplyr: library(plyr); library(dplyr)

Do not ignore this warning! My guess is this happened, you ignored it, and that's part of the source of your troubles. Probably you don't need plyr at all, but if you do, load it before dplyr!



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!