问题
I have data on all the NCAA basketball games that have occurred since 2003. I am trying to implement a for loop that will calculate the average of a number of stats for each time at a point in time. Here is my for loop:
library(data.table)
roll_season_team_stats <- NULL
for (i in 0:max(stats_DT$DayNum)) {
stats <- stats_DT[DayNum < i]
roll_stats <- dcast(stats_DT, TeamID+Season~.,fun=mean,na.rm=T,value.var = c('FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR', 'DR', 'TO'))
roll_stats$DayNum <- i + 1
roll_season_team_stats <- rbind(roll_season_team_stats, roll_stats)
}
Here is the output from dput:
structure(list(Season = c(2003L, 2003L, 2003L, 2003L, 2003L,
2003L, 2003L, 2003L, 2003L, 2003L), DayNum = c(10L, 10L, 11L,
11L, 11L, 11L, 12L, 12L, 12L, 12L), TeamID = c(1104L, 1272L,
1266L, 1296L, 1400L, 1458L, 1161L, 1186L, 1194L, 1458L), FGM = c(27L,
26L, 24L, 18L, 30L, 26L, 23L, 28L, 28L, 32L), FGA = c(58L, 62L,
58L, 38L, 61L, 57L, 55L, 62L, 58L, 67L), FGM3 = c(3L, 8L, 8L,
3L, 6L, 6L, 2L, 4L, 5L, 5L), FGA3 = c(14L, 20L, 18L, 9L, 14L,
12L, 8L, 14L, 11L, 17L), FTM = c(11L, 10L, 17L, 17L, 11L, 23L,
32L, 15L, 10L, 15L), FTA = c(18L, 19L, 29L, 31L, 13L, 27L, 39L,
21L, 18L, 19L), OR = c(14L, 15L, 17L, 6L, 17L, 12L, 13L, 13L,
9L, 14L), DR = c(24L, 28L, 26L, 19L, 22L, 24L, 18L, 35L, 22L,
22L), TO = c(23L, 13L, 10L, 12L, 14L, 9L, 17L, 19L, 17L, 6L)), row.names = c(NA,
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x102004ae0>)
The loop runs successfully but it is not producing the correct output. Rather than showing the team averages over time, it is giving me the same number (what I assume is the overall mean of each stat) for each day. Any ideas what is wrong with my loop? Thanks!
回答1:
If I understand correctly, the OP wants to compute the cumulative mean of some variables for each team and season "showing the team averages over time".
Although the OP uses the term "roll", e.g., roll_stats
or roll_season_team_stats
, his code suggests that he is not after a rolling mean but wants to compute cumulative means from the first DayNum
on, e.g.:
stats <- stats_DT[DayNum < i]
However, cumulative means can be calculated directly without creating the result piecewise in a for
loop or by lapply()
and combining the pieces afterwards.
Unfortunately, the sample dataset provided by the OP does contain rows for many different teams but no history, i.e., no data for the same team for a number of consecutive days. Therefore, I have modified the sample dataset for demonstration:
# create new sample data set
stats_DT2 <- copy(stats_DT)[, TeamID := c(1:2, 1:4, 1:4)][]
stats_DT2
Season DayNum TeamID FGM FGA FGM3 FGA3 FTM FTA OR DR TO 1: 2003 10 1 27 58 3 14 11 18 14 24 23 2: 2003 10 2 26 62 8 20 10 19 15 28 13 3: 2003 11 1 24 58 8 18 17 29 17 26 10 4: 2003 11 2 18 38 3 9 17 31 6 19 12 5: 2003 11 3 30 61 6 14 11 13 17 22 14 6: 2003 11 4 26 57 6 12 23 27 12 24 9 7: 2003 12 1 23 55 2 8 32 39 13 18 17 8: 2003 12 2 28 62 4 14 15 21 13 35 19 9: 2003 12 3 28 58 5 11 10 18 9 22 17 10: 2003 12 4 32 67 5 17 15 19 14 22 6
Now, as there are 2 to 3 rows for each team, the cumulative means can be calculated by:
# define function for cummulative mean
cummean <- function(x) cumsum(x) / seq_along(x)
# define variables to compute on
cols <- c('FGM', 'FGA', 'FGM3', 'FGA3', 'FTM', 'FTA', 'OR', 'DR', 'TO')
# compute aggregates
stats_DT2[order(DayNum), c(.(DayNum = DayNum), lapply(.SD, cummean)),
.SDcols = cols, by = .(TeamID, Season)][]
TeamID Season DayNum FGM FGA FGM3 FGA3 FTM FTA OR DR TO 1: 1 2003 10 27.00 58.0 3.000 14.00 11.0 18.00 14.00 24.00 23.00 2: 1 2003 11 25.50 58.0 5.500 16.00 14.0 23.50 15.50 25.00 16.50 3: 1 2003 12 24.67 57.0 4.333 13.33 20.0 28.67 14.67 22.67 16.67 4: 2 2003 10 26.00 62.0 8.000 20.00 10.0 19.00 15.00 28.00 13.00 5: 2 2003 11 22.00 50.0 5.500 14.50 13.5 25.00 10.50 23.50 12.50 6: 2 2003 12 24.00 54.0 5.000 14.33 14.0 23.67 11.33 27.33 14.67 7: 3 2003 11 30.00 61.0 6.000 14.00 11.0 13.00 17.00 22.00 14.00 8: 3 2003 12 29.00 59.5 5.500 12.50 10.5 15.50 13.00 22.00 15.50 9: 4 2003 11 26.00 57.0 6.000 12.00 23.0 27.00 12.00 24.00 9.00 10: 4 2003 12 29.00 62.0 5.500 14.50 19.0 23.00 13.00 23.00 7.50
Alternatively, the cumulative means can be appended:
# append cumulative columns
stats_DT2[order(DayNum), paste0("cm_", cols) := lapply(.SD, cummean),
.SDcols = cols, by = .(TeamID, Season)][]
Season DayNum TeamID FGM FGA FGM3 FGA3 FTM FTA OR DR TO cm_FGM cm_FGA cm_FGM3 cm_FGA3 cm_FTM cm_FTA cm_OR cm_DR cm_TO 1: 2003 10 1 27 58 3 14 11 18 14 24 23 27.00 58.0 3.000 14.00 11.0 18.00 14.00 24.00 23.00 2: 2003 10 2 26 62 8 20 10 19 15 28 13 26.00 62.0 8.000 20.00 10.0 19.00 15.00 28.00 13.00 3: 2003 11 1 24 58 8 18 17 29 17 26 10 25.50 58.0 5.500 16.00 14.0 23.50 15.50 25.00 16.50 4: 2003 11 2 18 38 3 9 17 31 6 19 12 22.00 50.0 5.500 14.50 13.5 25.00 10.50 23.50 12.50 5: 2003 11 3 30 61 6 14 11 13 17 22 14 30.00 61.0 6.000 14.00 11.0 13.00 17.00 22.00 14.00 6: 2003 11 4 26 57 6 12 23 27 12 24 9 26.00 57.0 6.000 12.00 23.0 27.00 12.00 24.00 9.00 7: 2003 12 1 23 55 2 8 32 39 13 18 17 24.67 57.0 4.333 13.33 20.0 28.67 14.67 22.67 16.67 8: 2003 12 2 28 62 4 14 15 21 13 35 19 24.00 54.0 5.000 14.33 14.0 23.67 11.33 27.33 14.67 9: 2003 12 3 28 58 5 11 10 18 9 22 17 29.00 59.5 5.500 12.50 10.5 15.50 13.00 22.00 15.50 10: 2003 12 4 32 67 5 17 15 19 14 22 6 29.00 62.0 5.500 14.50 19.0 23.00 13.00 23.00 7.50
回答2:
Avoid growing objects in a loop which leads to excessive copying in memory. Instead, build a list of data frames to be row binded once outside the loop.
dt_list <- lapply(0:max(stats_DT$DayNum), function(i)
tryCatch(
dcast(stats_DT[DayNum < i],
TeamID + Season ~ ., fun=mean, na.rm=TRUE,
value.var = c('FGM', 'FGA', 'FGM3', 'FGA3',
'FTM', 'FTA', 'OR', 'DR', 'TO')
)[, DayNum := i + 1],
error = function(e) NULL)
)
roll_season_team_stats <- data.table::rbindlist(dt_list)
In fact, you may be able to do this in base R with aggregate
on data frames:
stats_DF <- data.frame(stats_DT)
df_list <- lapply(0:max(stats_DT$DayNum), function(i)
tryCatch(
transform(aggregate(cbind(FGM, FGA, FGM3, FGA3,
FTM, FTA, OR, DR) ~ TeamID + Season,
stats_DF[stats_DF$DayNum < i,],
FUN = mean,
na.rm = TRUE),
DayNum = i + 1),
error = function(e) NULL)
)
roll_season_team_stats <- do.call(rbind, df_list)
Online Demo
来源:https://stackoverflow.com/questions/60346592/how-do-i-make-my-for-loop-properly-calculate-means-over-time