Plotting binned data using sum instead of count

霸气de小男生 提交于 2019-12-01 10:49:14

问题


I've tried to search for an answer, but can't seem to find the right one that does the job for me.

I have a dataset (data) with two variables: people's ages (age) and number of awards (awards)

My objective is to plot the number of awards against age in R. FYI, a person can have multiple awards and people can have the same age.

I tried to plot a histogram and barplot, but the problem with that is that it counts the number of observations instead of summing the number of awards.

A sample dataset:

age <- c(21,22,22,25,30,34,45,26,37,46,49,21)
awards <- c(0,3,2,1,0,0,1,3,1,1,1,1)
data <- data.frame(cbind(age,awards))

What I'm looking for is a histogram (or barplot) that represents this data.

Ideally, I'd want the ages to be split into age groups. For example, 20-30, 31-40, 41-50 and then the total number of awards for each group.

The age group would be on the x-axis and the total number of awards for each age group would be on the y-axis.

Thanks!


回答1:


We can use the aggregate function and then use the ggplot2 package. I don't make too many barplots in base R these days so I'm not sure of the best way to do it without loading ggplot2:

create sample data

#data
set.seed(123)
dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
                  awards = rpois(200, 3))
head(dat)
  age awards
1  28      2
2  44      6
3  32      3
4  47      3
5  49      2
6  21      5

By age

#aggregate

sum_by_age <- aggregate(awards ~ age, data = dat, FUN = sum)

library(ggplot2)

ggplot(sum_by_age, aes(x = age, y = awards))+
    geom_bar(stat = 'identity')

By age group

#create groups

dat$age_group <- ifelse(dat$age <= 30, '20-30',
                        ifelse(dat$age <= 40, '30-40',
                               '41 +'))

sum_by_age_group <- aggregate(awards ~ age_group, data = dat, FUN = sum)

ggplot(sum_by_age_group, aes(x = age_group, y = awards))+
    geom_bar(stat = 'identity')

Note

We could skip the aggregate step altogether and just use:

ggplot(dat, aes(x = age, y = awards)) + geom_bar(stat = 'identity')

but I don't prefer that way because I think having an intermediate data step may be useful within your analytical pipeline for comparisons other than visualizing.




回答2:


For completeness, I am adding the base R solution to @bouncyball's great answer. I will use their synthetic data, but I will use cut to create the age groups before aggregation.

# Creates data for plotting
> set.seed(123)
> dat <- data.frame(age = sample(20:50, 200, replace = TRUE),
                    awards = rpois(200, 3))

# Created a new column containing the age groups
> dat[["ageGroups"]] <- cut(dat[["age"]], c(-Inf, 20, 30, 40, Inf),
                            right = FALSE)

cut will divide up a set of numeric data based on breaks defined in the second argument. right = FALSE flips the breaks so values the groups would include the lower values rather than the upper ones (ie 20 <= x < 30 rather than the default of 20 < x <= 30). The groups do not have to be equally spaced. If you do not want to include data above or below a certain value, simply remove the Inf from the end or -Inf from the beginning respectively, and the function will return <NA> instead. If you would like to give your groups names, you can do so with the labels argument.

Now we can aggregate based on the groups we created.

> (summedGroups <- aggregate(awards ~ ageGroups, dat, FUN = sum))
  ageGroups awards
1   [20,30)    188
2   [30,40)    212
3 [40, Inf)    194

Finally, we can plot these data using the barplot function. The key here is to use names for the age groups.

> barplot(summedGroups[["awards"]], names = summedGroups[["ageGroups"]])



来源:https://stackoverflow.com/questions/41409573/plotting-binned-data-using-sum-instead-of-count

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!