dplyr - filter by group size

后端 未结 6 1919
渐次进展
渐次进展 2020-11-28 15:12

What is the best way to filter a data.frame to only get groups of say size 5?

So my data looks as follows:

require(dplyr)
n <- 1e5
x <- rnorm(n         


        
6条回答
  •  心在旅途
    2020-11-28 15:54

    I generalised the function written by docendo discimus, to use it alongside existing dplyr functions:

    #' inherit dplyr::filter
    #' @param min minimal group size, use \code{min = NULL} to filter on maximal group size only
    #' @param max maximal group size, use \code{max = NULL} to filter on minimal group size only
    #' @export
    #' @source Stack Overflow answer by docendo discimus, \url{https://stackoverflow.com/a/43110620/4575331}
    filter_group_size <- function(.data, min = NULL, max = min) {
      g <- dplyr::group_size(.data)
      if (is.null(min) & is.null(max)) {
        stop('`min` and `max` cannot both be NULL.')
      }
      if (is.null(max)) {
        max <- base::max(g, na.rm = TRUE)
      }
      ind <- base::rep(g >= min & g <= max, g)
      .data[ind, ]
    }
    

    Let's check it for a minimal group size of 5:

    dat2 %>%
      group_by(cat) %>%
      filter_group_size(5, NULL) %>%
      summarise(n = n()) %>%
      arrange(desc(n))
    
    # # A tibble: 6,634 x 2
    #      cat     n
    #     
    #  1    NA    19
    #  2     1     5
    #  3     2     5
    #  4     6     5
    #  5    15     5
    #  6    17     5
    #  7    21     5
    #  8    27     5
    #  9    33     5
    # 10    37     5
    # # ... with 6,624 more rows
    

    Great, now check for the OP's question; a group size of exactly 5:

    dat2 %>%
      group_by(cat) %>%
      filter_group_size(5) %>%
      summarise(n = n()) %>%
      pull(n) %>%
      unique()
    # [1] 5
    

    Hooray.

提交回复
热议问题