Using R & dplyr to summarize - group_by, count, mean, sd [closed]

♀尐吖头ヾ 提交于 2019-12-12 17:18:53

问题


Good day and greetings! This is my first post on Stack Overflow. I am fairly new to R and even newer dplyr. I have a small data set comprised of 2 columns - var1 and var2. The var1 column is comprised of num values. The var2 column is comprised of factors with 3 levels - A, B, and C.

        var1 var2
1  1.4395244    A
2  1.7698225    A
3  3.5587083    A
4  2.0705084    A
5  2.1292877    A
6  3.7150650    B
7  2.4609162    B
8  0.7349388    B
9  1.3131471    B
10 1.5543380    B
11 3.2240818    C
12 2.3598138    C
13 2.4007715    C
14 2.1106827    C
15 1.4441589    C

'data.frame':   15 obs. of  2 variables:
 $ var1: num  1.44 1.77 3.56 2.07 2.13 ...
 $ var2: Factor w/ 3 levels "A","B","C": 1 1 1 1 1 2 2 2 2 2 ...

I am trying to use dplyr to group_by var2 (A, B, and C) then count, and summarize the var1 by mean and sd. The count works but rather than provide the mean and sd for each group, I receive the overall mean and sd next to each group.

To try to resolve the issue, I have conducted multiple internet searches. All results seem to offer a similar syntax to the one I am using. I have also read through all of the recommended posts that Stack Overflow offered prior to posting. Also, I tried restarting R and I made sure that I am not using plyr.

Here is the code that I used to create the data set and the dplyr group_by / summarize.

library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B",
          "C", "C", "C", "C", "C")
df <- data.frame(var1, var2)
df

df %>%
  group_by(df$var2) %>%
  summarize(
    count = n(),
    mean = mean(df$var1, na.rm = TRUE),
    sd = sd(df$var1, na.rm = TRUE)
  )

Here are the results:

# A tibble: 3 x 4
  `df$var2` count  mean    sd
  <fct>     <int> <dbl> <dbl>
1 A             5  2.15 0.845
2 B             5  2.15 0.845
3 C             5  2.15 0.845

The count appears to work showing a count of 5 for each group. Each group is showing the overall mean and sd for the whole column rather than each group. The expected results are the count, mean, and sd for each group.

I am sure I am overlooking something obvious but I would greatly appreciate any assistance.

Thanks!


回答1:


Even though answered via comments, I felt such a nice reproducible example for a very first question deserved an official answer.

library(dplyr)
set.seed(123)
var1 <- rnorm(15, mean=2, sd=1)
var2 <- c(rep("A", 5), rep("B", 5), rep("C", 5))
df <- data.frame(var1, var2) 
df_stat <- df %>% group_by(var2) %>% summarize(
                                      count = n(),
                                       mean = mean(var1, na.rm = TRUE), 
                                         sd = sd(var1, na.rm = TRUE)) 
head(df_stat)
# A tibble: 3 x 4
# var2   count  mean    sd
# <fct>  <int>  <dbl>  <dbl>
# 1 A      5    2.19   0.811
# 2 B      5    1.96   1.16 
# 3 C      5    2.31   0.639


来源:https://stackoverflow.com/questions/57194260/using-r-dplyr-to-summarize-group-by-count-mean-sd

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!