Relative frequencies / proportions with dplyr

问题

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?

library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)

# count frequency
mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n())

# am gear  n
#  0    3 15 
#  0    4  4 
#  1    4  8  
#  1    5  5

What I would like to achieve:

am gear  n rel.freq
 0    3 15      0.7894737
 0    4  4      0.2105263
 1    4  8      0.6153846
 1    5  5      0.3846154

回答1:

Try this:

mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

#   am gear  n      freq
# 1  0    3 15 0.7894737
# 2  0    4  4 0.2105263
# 3  1    4  8 0.6153846
# 4  1    5  5 0.3846154

From the dplyr vignette:

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.

Thus, after the summarise, the last of the grouping variables 'gear' is peeled off, and the data is then grouped 'only' by 'am' (just check it with groups on the resulting data), on which we then perform the mutate calculation.

The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.

回答2:

You can use count() function, which has however a different behaviour depending on the version of dplyr:

dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations

dplyr 0.7.1

mtcars %>%
  count(am, gear) %>%
  group_by(am) %>%
  mutate(freq = n / sum(n))

dplyr < 0.7.1

mtcars %>%
  count(am, gear) %>%
  mutate(freq = n / sum(n))

This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().

回答3:

@Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

EDIT Because Spacedman asked for it :-)

as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
    class(x) <- c("rel_freq", class(x))
    attributes(x)[["rel_freq_col"]] <- rel_freq_col
    x
}

print.rel_freq <- function(x, ...) {
    freq_col <- attributes(x)[["rel_freq_col"]]
    x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")   
    class(x) <- class(x)[!class(x)%in% "rel_freq"]
    print(x)
}

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = n/sum(n)) %>%
  as.rel_freq()

## Source: local data frame [4 x 4]
## Groups: am
## 
##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

回答4:

Here is a general function implementing Henrik's solution on dplyr 0.7.1.

freq_table <- function(x, 
                       group_var, 
                       prop_var) {
  group_var <- enquo(group_var)
  prop_var  <- enquo(prop_var)
  x %>% 
    group_by(!!group_var, !!prop_var) %>% 
    summarise(n = n()) %>% 
    mutate(freq = n /sum(n)) %>% 
    ungroup
}

回答5:

I wrote a small function for this repeating task:

count_pct <- function(df) {
  return(
    df %>%
      tally %>% 
      mutate(n_pct = 100*n/sum(n))
  )
}

I can then use it like:

mtcars %>% 
  group_by(cyl) %>% 
  count_pct

It returns:

# A tibble: 3 x 3
    cyl     n n_pct
  <dbl> <int> <dbl>
1     4    11  34.4
2     6     7  21.9
3     8    14  43.8

回答6:

This answer is based upon Matifou's answer.

First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.

Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.

getOption("scipen") 
options("scipen"=10) 
mtcars %>%
count(am, gear) %>% 
mutate(freq = (n / sum(n)) * 100)

回答7:

Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.

library("dplyr")
mtcars %>%
    group_by(am, gear) %>%
    summarise(n = n()) %>%
    mutate(freq = prop.table(n))

library("data.table")
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n) , by = "am"]

来源：https://stackoverflow.com/questions/24576515/relative-frequencies-proportions-with-dplyr

标签

group-by

dplyr

frequency