问题
I have a football dataset for a season and some variable are: player_id
, week
and points
(a grade for each player in a match).
So, each player_id
appears several times in my dataset.
My goal is to calculate the average points for each player, but just to previous weeks.
For example, to the row where player_id=5445
and week=10
, I want the mean when data has player_id=5445
and week is from 1 to 9.
I know I can do it filtering data for each row and calculating it. But I hope to do it in a smarter/faster way...
I thought something like:
aggregate(mydata$points, FUN=mean,
by=list(player_id=mydata$player_id, week<mydata$week))
but it did not work
Thankss!!!
回答1:
Here's a solution along with some sample data,
football_df <-
data.frame(player_id = c(1, 2, 3, 4),
points = as.integer(runif(40, 0, 10)),
week = rep(1:10, each = 4))
Getting a running average:
require(dplyr)
football_df %>%
group_by(player_id) %>% # the group to perform the stat on
arrange(week) %>% # order the weeks within each group
mutate(avg = cummean(points) ) %>% # for each week get the cumulative mean
mutate(avg = lag(avg) ) %>% # shift cumulative mean back one week
arrange(player_id) # sort by player_id
Here's the first two players of the resulting table, for which you can see that for player 1 in week 2, the previous week's average is 7, and in week 3, the previous week's average is (9+7) / 2 = 8 ... :
player_id points week avg
1 1 7 1 NA
2 1 9 2 7.000000
3 1 9 3 8.000000
4 1 1 4 8.333333
5 1 4 5 6.500000
6 1 8 6 6.000000
7 1 0 7 6.333333
8 1 2 8 5.428571
9 1 5 9 5.000000
10 1 8 10 5.000000
11 2 6 1 NA
12 2 9 2 6.000000
13 2 5 3 7.500000
14 2 1 4 6.666667
15 2 0 5 5.250000
16 2 9 6 4.200000
17 2 8 7 5.000000
18 2 6 8 5.428571
19 2 6 9 5.500000
20 2 8 10 5.555556
回答2:
I will use your data but with a call to set.seed
to make the results reproducible. Then I will call aggregate
with the formula interface. Note that I've changed the name of the variable week
to last_week
to be used in subset
.
set.seed(2550) # make the results reproducible
player_id <- c(3242,56546,76575,4234,654654,6564,43242,42344,4342,6776,5432,8796,54767)
week <- 1:30
points <- rnorm(390)
mydata <- data.frame(player_id = rep(player_id, 30),
week = rep(week,13),points)
last_week <- 10
agg <- aggregate(points ~ player_id + week, data = subset(mydata, week < last_week), mean)
head(agg)
# player_id week points
#1 3242 1 -1.3281831
#2 4234 1 0.3578657
#3 4342 1 -0.8267423
#4 5432 1 -0.4245487
#5 6564 1 -0.2968879
#6 6776 1 0.8348178
来源:https://stackoverflow.com/questions/47084315/calculating-the-mean-using-logical-condition