问题
The dataset
gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
df <- data.frame(gender, answer)
is biased towards females:
df %>% ggplot(aes(gender, fill = gender)) + geom_bar()
My task is to build a graph that makes it easy to figure out which of the two genders is more likely to say 'Yes'
.
But, given the bias, I cannot just do
df %>% ggplot(aes(x = answer, fill = gender)) + geom_bar(position = 'dodge')
or even
df %>% ggplot(aes(x = answer, y = ..count../sum(..count..), fill = gender)) +
geom_bar(position = 'dodge')
To alleviate the bias I need to divide each of the counts by the total number of males or females respectively so that the 'Female'
bars add up to 1
as well as the 'Male'
ones. Like so:
df.total <- df %>% count(gender)
male.total <- (df.total %>% filter(gender == 'Male'))$n
female.total <- (df.total %>% filter(gender == 'Female'))$n
df %>% count(answer, gender) %>%
mutate(freq = n/if_else(gender == 'Male', male.total, female.total)) %>%
ggplot(aes(x = answer, y = freq, fill = gender)) +
geom_bar(stat="identity", position = 'dodge')
Which draws a completely different picture.
Questions:
- Is there a way to simplify the former piece of code using only
dplyr
andggplot2
? - Are there any other libraries that can do the trick better?
- Does the above type of chart have a conventional name?
Thanks.
回答1:
Question 1:
df %>%
count(gender, answer) %>%
group_by(gender) %>%
mutate(freq = n/sum(n)) %>%
ggplot(aes(x = answer, y = freq, fill = gender)) +
geom_bar(stat="identity", position = 'dodge')
Question 2:
You can probably do it in fewer lines with other packages.
Question 3:
Relative frequency bar graph.
回答2:
Given the data, the most effective way to determine whether men or women are more likely to answer "yes" to the question asked is to convert the data to a binary variable and run a difference of proportions test.
gender <- c('Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female', 'Female')
answer <- c('Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
isYes <- ifelse(answer=="Yes",1,0)
t.test(isYes ~ gender)
...and the output:
> t.test(isYes ~ gender)
Welch Two Sample t-test
data: isYes by gender
t = -0.34659, df = 14.749, p-value = 0.7338
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5965761 0.4299094
sample estimates:
mean in group Female mean in group Male
0.4166667 0.5000000
The t.test()
output provides the same percentages of yes
as the weighted frequency chart, but the p-value from the test statistic indicates that we should accept the null hypothesis that there is no difference between men and women in their likelihood to answer yes
to the question asked.
Another way to interpret the t.test()
output is that since 0 is within the 95% confidence interval of the difference of means, we fail to reject the null hypothesis that the means of the two groups are equal.
回答3:
position = "fill"
in geom_bar
is useful for seeing relative proportions:
library(ggplot2)
df <- data.frame(gender = c("Male", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Female", "Male", "Female", "Female", "Male", "Female", "Female"),
answer = c("Yes", "No", "Yes", "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes", "Yes", "Yes", "Yes", "No", "Yes"),
stringsAsFactors = FALSE)
ggplot(df, aes(gender, fill = answer)) + geom_bar(position = 'fill')
来源:https://stackoverflow.com/questions/48434062/side-by-side-bar-chart-with-columns-proportional-by-group-relative-frequency-ba