R dplyr summarise multiple functions to selected variables

后端 未结 4 1774
死守一世寂寞
死守一世寂寞 2021-01-15 11:51

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.

Let me start with an example of what I would like to a

4条回答
  •  [愿得一人]
    2021-01-15 12:17

    Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.

    Solution one

    My own take:

    mapply(summarise_at, 
           .vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"), 
           .funs = lst(mean, max), 
           MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5))) 
    %>% reduce(merge, by = "Species")
    
        #         Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
        #    1     setosa        5.314       3.714        1.509        0.2773           0.5
        #    2 versicolor        5.998       2.804        4.317        1.3468           1.8
        #    3  virginica        6.622       2.984        5.573        2.0327           2.5
    

    Solution two

    An elegant solution using package purrr from the tidyverse itself, inspired by this discussion:

    list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
         .funs = lst("mean" = mean, "max" = max)) %>% 
          pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y)) 
          %>% reduce(inner_join, by = "Species")
    
    + + + # A tibble: 3 x 6
      Species    Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
                                                    
    1 setosa             5.31        3.71         1.51         0.277           0.5
    2 versicolor         6.00        2.80         4.32         1.35            1.8
    3 virginica          6.62        2.98         5.57         2.03            2.5
    

    Short discussion

    The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns.

    Both solutions hinge on three realizations:

    1. summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!
    2. Now, what does the above in R? What does force an operation to corresponding elements of two lists? Functions such as mapply or the family of functions map2, pmap and variations thereof from dplyr's tidyverse fellow purrr. Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.
    3. Because the product is not a tibble or a data.frame, but a list, you simply need to use reduce with inner_join or just merge.

    Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).

提交回复
热议问题