Relative frequencies / proportions with dplyr

徘徊边缘 提交于 2019-11-26 11:02:35

Try this:

mtcars %>%
  group_by(am, gear) %>%
  summarise (n = n()) %>%
  mutate(freq = n / sum(n))

#   am gear  n      freq
# 1  0    3 15 0.7894737
# 2  0    4  4 0.2105263
# 3  1    4  8 0.6153846
# 4  1    5  5 0.3846154

From the dplyr vignette:

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.

Thus, after the summarise, the grouping variable 'gear' is peeled off, and the data is then grouped 'only' by 'am' (just check it with groups on the resulting data), on which we then perform the mutate calculation.

The outcome of the 'peeling' is of course dependent of the order of the grouping variables in the group_by call. We were lucky this time, that it peeled off the desired variable. You may wish to do a subsequent group_by(am), to make your code more explicit.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.

You can use count() function, which has however a different behaviour depending on the version of dplyr:

  • dplyr 0.7.1: returns an ungrouped table: you need to group again by am

  • dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations

dplyr 0.7.1

mtcars %>%
  count(am, gear) %>%
  group_by(am) %>%
  mutate(freq = n / sum(n))

dplyr < 0.7.1

mtcars %>%
  count(am, gear) %>%
  mutate(freq = n / sum(n))

This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().

@Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

EDIT Because Spacedman asked for it :-)

as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
    class(x) <- c("rel_freq", class(x))
    attributes(x)[["rel_freq_col"]] <- rel_freq_col
    x
}

print.rel_freq <- function(x, ...) {
    freq_col <- attributes(x)[["rel_freq_col"]]
    x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")   
    class(x) <- class(x)[!class(x)%in% "rel_freq"]
    print(x)
}

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = n/sum(n)) %>%
  as.rel_freq()

## Source: local data frame [4 x 4]
## Groups: am
## 
##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

Here is a general function implementing Henrik's solution on dplyr 0.7.1.

freq_table <- function(x, 
                       group_var, 
                       prop_var) {
  group_var <- enquo(group_var)
  prop_var  <- enquo(prop_var)
  x %>% 
    group_by(!!group_var, !!prop_var) %>% 
    summarise(n = n()) %>% 
    mutate(freq = n /sum(n)) %>% 
    ungroup
}

I wrote a small function for this repeating task:

count_pct <- function(df) {
  return(
    df %>%
      tally %>% 
      mutate(n_pct = 100*n/sum(n))
  )
}

I can then use it like:

mtcars %>% 
  group_by(cyl) %>% 
  count_pct

It returns:

# A tibble: 3 x 3
    cyl     n n_pct
  <dbl> <int> <dbl>
1     4    11  34.4
2     6     7  21.9
3     8    14  43.8

This answer is based upon Matifou's answer.

First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.

Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.

getOption("scipen") 
options("scipen"=10) 
mtcars %>%
count(am, gear) %>% 
mutate(freq = (n / sum(n)) * 100)

Despite the many answers, one more approach which uses prop.table in combination with dplyr or data.table.

library("dplyr")
mtcars %>%
    group_by(am, gear) %>%
    summarise(n = n()) %>%
    mutate(freq = prop.table(n))

library("data.table")
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n) , by = "am"]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!