Side-by-side bar chart with columns proportional by group (relative frequency bar chart)

问题

The dataset

gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
df <- data.frame(gender, answer)

is biased towards females:

df %>% ggplot(aes(gender, fill = gender)) + geom_bar()

My task is to build a graph that makes it easy to figure out which of the two genders is more likely to say 'Yes'.

But, given the bias, I cannot just do

df %>% ggplot(aes(x = answer, fill = gender)) + geom_bar(position = 'dodge')

or even

df %>% ggplot(aes(x = answer, y = ..count../sum(..count..), fill = gender)) +
geom_bar(position = 'dodge')

To alleviate the bias I need to divide each of the counts by the total number of males or females respectively so that the 'Female' bars add up to 1 as well as the 'Male' ones. Like so:

df.total <- df %>% count(gender)
male.total <- (df.total %>% filter(gender == 'Male'))$n
female.total <- (df.total %>% filter(gender == 'Female'))$n

df %>% count(answer, gender) %>% 
mutate(freq = n/if_else(gender == 'Male', male.total, female.total)) %>% 
ggplot(aes(x = answer, y = freq, fill = gender)) + 
geom_bar(stat="identity", position = 'dodge')

Which draws a completely different picture.

Questions:

Is there a way to simplify the former piece of code using only dplyr and ggplot2?
Are there any other libraries that can do the trick better?
Does the above type of chart have a conventional name?

Thanks.

回答1:

Question 1:

df %>%  
  count(gender, answer) %>% 
  group_by(gender) %>% 
  mutate(freq = n/sum(n)) %>% 
  ggplot(aes(x = answer, y = freq, fill = gender)) + 
  geom_bar(stat="identity", position = 'dodge')

Question 2:

You can probably do it in fewer lines with other packages.

Question 3:

Relative frequency bar graph.

回答2:

Given the data, the most effective way to determine whether men or women are more likely to answer "yes" to the question asked is to convert the data to a binary variable and run a difference of proportions test.

gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
isYes <- ifelse(answer=="Yes",1,0)

t.test(isYes ~ gender)

...and the output:

> t.test(isYes ~ gender)

    Welch Two Sample t-test

data:  isYes by gender
t = -0.34659, df = 14.749, p-value = 0.7338
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5965761  0.4299094
sample estimates:
mean in group Female   mean in group Male 
           0.4166667            0.5000000

The t.test() output provides the same percentages of yes as the weighted frequency chart, but the p-value from the test statistic indicates that we should accept the null hypothesis that there is no difference between men and women in their likelihood to answer yes to the question asked.

Another way to interpret the t.test() output is that since 0 is within the 95% confidence interval of the difference of means, we fail to reject the null hypothesis that the means of the two groups are equal.

回答3:

position = "fill" in geom_bar is useful for seeing relative proportions:

library(ggplot2)

df <- data.frame(gender = c("Male", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Female", "Male", "Female", "Female", "Male", "Female", "Female"), 
                 answer = c("Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes"),
                 stringsAsFactors = FALSE)

ggplot(df, aes(gender, fill = answer)) + geom_bar(position = 'fill')

来源：https://stackoverflow.com/questions/48434062/side-by-side-bar-chart-with-columns-proportional-by-group-relative-frequency-ba

标签

ggplot2

dplyr

bar-chart