Get group mean with multiple grouping variables and excluding own group value

不问归期 提交于 2020-03-23 02:05:28

问题


I'm looking for a faster way to calculate a group mean with multiple grouping variables while excluding own group values. A thought experiment would be finding average value (e.g. price) for a county from the counties in the same state in the same year excluding own county's value. Here's a toy data set.

df <- data_frame(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)

df %>%
  group_by(state, county, year) %>%
  summarise(q = mean(df$value[df$state == state & df$county != county & df$year == year]))

# Groups:   state, county [6]
   state county  year     q
   <chr> <chr>  <int> <dbl>
 1 AL    a       2011  56  
 2 AL    a       2012  46  
 3 AL    b       2011  50.5
 4 AL    b       2012  52  
 5 AL    c       2011  55.5
 6 AL    c       2012  29  
 7 CA    d       2011  52.5
 8 CA    d       2012  32  
 9 CA    e       2011  68.5
10 CA    e       2012  31.5
11 CA    f       2011  32  
12 CA    f       2012  42.5

The above code gives me a desired result, but when I apply this to a larger dataset (with more grouping variables) it gets really slow. Do you have any suggestion on how to speed this up?

If the original approach is incorrect, please point that out as well.


回答1:


library(dplyr)

df %>%
  group_by(state, year) %>%
  mutate(q = (sum(value) - value) / (n()-1))

#> # A tibble: 12 x 5
#> # Groups:   state, year [4]
#>    state county  year value     q
#>    <chr> <chr>  <int> <int> <dbl>
#>  1 AL    a       2011    68  30.5
#>  2 AL    a       2012    63  42  
#>  3 AL    b       2011    53  38  
#>  4 AL    b       2012    56  45.5
#>  5 AL    c       2011     8  60.5
#>  6 AL    c       2012    28  59.5
#>  7 CA    d       2011     7  40  
#>  8 CA    d       2012    69  41  
#>  9 CA    e       2011    39  24  
#> 10 CA    e       2012    79  36  
#> 11 CA    f       2011    41  23  
#> 12 CA    f       2012     3  74

Data:

#data_frame is deprecate!
df <- tibble(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)



回答2:


A more efficient approach would be to sum the 'value's after grouping by 'state', 'year', subtract from the 'value', then divide the n() - 1 observations

library(dplyr)   
library(purrr) 
out2 <- df %>%
          group_split(state, year) %>%
          map_dfr(~ .x %>% 
                 mutate(q = (sum(value) - value)/(n()-1))) %>%
          select(-value) %>%
          arrange(state, county)

-checking with the OP's output ('out1')

all.equal(out2, out1, check.attributes = FALSE)
#[1] TRUE


来源:https://stackoverflow.com/questions/59778868/get-group-mean-with-multiple-grouping-variables-and-excluding-own-group-value

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!