Calculating the mean using logical condition

问题

I have a football dataset for a season and some variable are: player_id, week and points (a grade for each player in a match).

So, each player_id appears several times in my dataset.

My goal is to calculate the average points for each player, but just to previous weeks.

For example, to the row where player_id=5445 and week=10, I want the mean when data has player_id=5445 and week is from 1 to 9.

I know I can do it filtering data for each row and calculating it. But I hope to do it in a smarter/faster way...

I thought something like:

aggregate(mydata$points, FUN=mean, 
          by=list(player_id=mydata$player_id, week<mydata$week))

but it did not work

Thankss!!!

回答1:

Here's a solution along with some sample data,

football_df <- 
  data.frame(player_id = c(1, 2, 3, 4),
             points = as.integer(runif(40, 0, 10)), 
             week = rep(1:10, each = 4))

Getting a running average:

require(dplyr)
football_df %>% 
      group_by(player_id) %>%    # the group to perform the stat on
      arrange(week) %>%          # order the weeks within each group
      mutate(avg = cummean(points) ) %>% # for each week get the cumulative mean
      mutate(avg = lag(avg) ) %>% # shift cumulative mean back one week
      arrange(player_id) # sort by player_id

Here's the first two players of the resulting table, for which you can see that for player 1 in week 2, the previous week's average is 7, and in week 3, the previous week's average is (9+7) / 2 = 8 ... :

   player_id points week      avg
1          1      7    1       NA
2          1      9    2 7.000000
3          1      9    3 8.000000
4          1      1    4 8.333333
5          1      4    5 6.500000
6          1      8    6 6.000000
7          1      0    7 6.333333
8          1      2    8 5.428571
9          1      5    9 5.000000
10         1      8   10 5.000000
11         2      6    1       NA
12         2      9    2 6.000000
13         2      5    3 7.500000
14         2      1    4 6.666667
15         2      0    5 5.250000
16         2      9    6 4.200000
17         2      8    7 5.000000
18         2      6    8 5.428571
19         2      6    9 5.500000
20         2      8   10 5.555556

回答2:

I will use your data but with a call to set.seed to make the results reproducible. Then I will call aggregate with the formula interface. Note that I've changed the name of the variable week to last_week to be used in subset.

set.seed(2550)    # make the results reproducible

player_id <- c(3242,56546,76575,4234,654654,6564,43242,42344,4342,6776,5432,8796,54767)
week <- 1:30
points <- rnorm(390)
mydata <- data.frame(player_id = rep(player_id, 30), 
                     week = rep(week,13),points)

last_week <- 10
agg <- aggregate(points ~ player_id + week, data = subset(mydata, week < last_week), mean)
head(agg)
#  player_id week     points
#1      3242    1 -1.3281831
#2      4234    1  0.3578657
#3      4342    1 -0.8267423
#4      5432    1 -0.4245487
#5      6564    1 -0.2968879
#6      6776    1  0.8348178

来源：https://stackoverflow.com/questions/47084315/calculating-the-mean-using-logical-condition

标签

mean