Relative frequencies / proportions with dplyr

前端 未结 9 2385
灰色年华
灰色年华 2020-11-22 09:25

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative f

9条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-11-22 09:53

    For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.

    With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.

    library(dplyr)
    library(scales)
    
    original <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n()) %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    #> `summarise()` regrouping output by 'am' (override with `.groups` argument)
    
    original
    #> # A tibble: 4 x 4
    #> # Groups:   am [2]
    #>      am  gear     n rel.freq
    #>         
    #> 1     0     3    15 78.9%   
    #> 2     0     4     4 21.1%   
    #> 3     1     4     8 61.5%   
    #> 4     1     5     5 38.5%
    
    new_drop_last <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "drop_last") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    dplyr::all_equal(original, new_drop_last)
    #> [1] TRUE
    

    With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by

    # .groups = "drop"
    new_drop <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "drop") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    new_drop
    #> # A tibble: 4 x 4
    #>      am  gear     n rel.freq
    #>         
    #> 1     0     3    15 46.9%   
    #> 2     0     4     4 12.5%   
    #> 3     1     4     8 25.0%   
    #> 4     1     5     5 15.6%
    

    If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.

    Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation

    # .groups = "keep"
    new_keep <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "keep") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    new_keep
    #> # A tibble: 4 x 4
    #> # Groups:   am, gear [4]
    #>      am  gear     n rel.freq
    #>         
    #> 1     0     3    15 100.0%  
    #> 2     0     4     4 100.0%  
    #> 3     1     4     8 100.0%  
    #> 4     1     5     5 100.0%
    
    # .groups = "rowwise"
    new_rowwise <- mtcars %>%
      group_by (am, gear) %>%
      summarise (n=n(), .groups = "rowwise") %>%
      mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
    
    dplyr::all_equal(new_keep, new_rowwise)
    #> [1] TRUE
    

    Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.

    # create a subtotal line to help readability
    subtotal_am <- mtcars %>%
      group_by (am) %>% 
      summarise (n=n()) %>%
      mutate(gear = NA, rel.freq = 1)
    #> `summarise()` ungrouping output (override with `.groups` argument)
    
    mtcars %>% group_by (am, gear) %>%
      summarise (n=n()) %>% 
      mutate(rel.freq = n/sum(n)) %>%
      bind_rows(subtotal_am) %>%
      arrange(am, gear) %>%
      mutate(rel.freq =  scales::percent(rel.freq, accuracy = 0.1))
    #> `summarise()` regrouping output by 'am' (override with `.groups` argument)
    #> # A tibble: 6 x 4
    #> # Groups:   am [2]
    #>      am  gear     n rel.freq
    #>         
    #> 1     0     3    15 78.9%   
    #> 2     0     4     4 21.1%   
    #> 3     0    NA    19 100.0%  
    #> 4     1     4     8 61.5%   
    #> 5     1     5     5 38.5%   
    #> 6     1    NA    13 100.0%
    

    Created on 2020-11-09 by the reprex package (v0.3.0)

    Hope you find this answer useful.

提交回复
热议问题