Tidyverse: filtering n largest groups in grouped dataframe

核能气质少年 提交于 2020-01-02 01:10:20

问题


I want to filter the n largest groups based on count, and then do some calculations on the filtered dataframe

Here is some data

Brand <- c("A","B","C","A","A","B","A","A","B","C")
Category <- c(1,2,1,1,2,1,2,1,2,1)
Clicks <- c(10,11,12,13,14,15,14,13,12,11)
df <- data.frame(Brand,Category,Clicks)

|Brand | Category| Clicks|
|:-----|--------:|------:|
|A     |        1|     10|
|B     |        2|     11|
|C     |        1|     12|
|A     |        1|     13|
|A     |        2|     14|
|B     |        1|     15|
|A     |        2|     14|
|A     |        1|     13|
|B     |        2|     12|
|C     |        1|     11|

This is my expected output. I want to filter out the two largest brands by count and then find the mean clicks in each brand / category combination

|Brand | Category| mean_clicks|
|:-----|--------:|-----------:|
|A     |        1|        12.0|
|A     |        2|        14.0|
|B     |        1|        15.0|
|B     |        2|        11.5|

Which I thought could be achieved with code like this (but can't)

df %>%
  group_by(Brand, Category) %>%
  top_n(2, Brand) %>% # Largest 2 brands by count
  summarise(mean_clicks = mean(Clicks))

EDIT: the ideal answer should be able to be used on database tables as well as local tables


回答1:


Another dplyr solution using a join to filter the data frame:

library(dplyr)

df %>%
  group_by(Brand) %>%
  summarise(n = n()) %>%
  top_n(2) %>% # select top 2
  left_join(df, by = "Brand") %>% # filters out top 2 Brands
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# # A tibble: 4 x 3
# # Groups:   Brand [?]
#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
# 1 A            1        12  
# 2 A            2        14  
# 3 B            1        15  
# 4 B            2        11.5



回答2:


A different dplyr solution:

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  mutate(rank = dense_rank(desc(n))) %>%
  filter(rank == 1 | rank == 2) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <fct>    <dbl>       <dbl>
1 A           1.        12.0
2 A           2.        14.0
3 B           1.        15.0
4 B           2.        11.5

Or a simplified version (based on suggestions from @camille):

df %>%
  group_by(Brand) %>%
  mutate(n = n()) %>%
  ungroup() %>%
  filter(dense_rank(desc(n)) < 3) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))



回答3:


EDIT

Based on updated question, we can add a count column first, filter only top n group counts, then group_by Brand and Category to find the mean for each group.

df %>%
  add_count(Brand, sort = TRUE) %>%
  filter(n %in% head(unique(n), 2)) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))


#   Brand Category mean_clicks
#   <fct>    <dbl>       <dbl>
#1 A            1        12  
#2 A            2        14  
#3 B            1        15  
#4 B            2        11.5

Original Answer

We can group_by Brand and do all the calculations by group and then filter top groups by top_n

library(dplyr)
df %>%
  group_by(Brand) %>%
  summarise(n = n(), 
            mean = mean(Clicks)) %>%
  top_n(2, n) %>%
  select(-n)

#  Brand  mean
#  <fct> <dbl>
#1  A      12.8
#2  B      12.7



回答4:


A data.table idea is to get the counts grouped by Brands and filter the top two (after ordering in descending order). Then we merge with the original data frame and find the mean grouped by (Brand, Category)

library(data.table)

#Convert to data.table
dt1 <- setDT(df)

dt1[dt1[, .(cnt = .N), by = Brand][
             order(cnt, decreasing = TRUE), .SD[1:2]][,cnt := NULL], 
                   on = 'Brand'][, .(means = mean(Clicks)), by = .(Brand, Category)][]

which gives,

   Brand Category means
1:     A        1  12.0
2:     A        2  14.0
3:     B        2  11.5
4:     B        1  15.0



回答5:


How about this approach, using table, from base R -

df %>%
  filter(Brand %in% names(tail(sort(table(Brand)), 2))) %>%
  group_by(Brand, Category) %>%
  summarise(mean_clicks = mean(Clicks))

# A tibble: 4 x 3
# Groups:   Brand [?]
  Brand Category mean_clicks
  <chr>    <dbl>       <dbl>
1 A         1.00        12.0
2 A         2.00        14.0
3 B         1.00        15.0
4 B         2.00        11.5



回答6:


Slightly different than above. Just because I don't like to use join with large datasets. Some people might not like that I make and remove a small dataframe, sorry :(

df %>% count(Brand) %>% top_n(2,n) -> Top2
df %>% group_by(Brand, Category) %>% 
filter(Brand %in% Top2$Brand) %>% 
summarise(mean_clicks = mean(Clicks))
remove(Top2)


来源:https://stackoverflow.com/questions/52532080/tidyverse-filtering-n-largest-groups-in-grouped-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!