Rolling average calculating some values it shouldn't?

我的未来我决定 提交于 2021-01-29 14:00:02

问题


Going off my question here I was redirected to another thread and was able to manipulate the code presented in that answer to get to where I want to be. I'm running into one problem now though and I'm a bit confused as to how it's coming about.

My dataframe in essence looks as follows:

Date   HomeTeam   AwayTeam   HGoals   AGoals   HGRollA   AGRollA
1/1    AAA        BBB        4        2        2.67      1.67

Link to a more detailed image of said dataframe with some extra columns.

Basically, every row has:
-the date of the match
-the home and away teams
-the goals scored that day by the home and away teams
-AND the 2 columns that I added in, in which is calculated the rolling average of goals scored by the home team and away team in their last 3 respective matches, NOT including the current row. So in the above instance team AAA scored on average 2.67 goals in their last 3 matches (home OR away), PRIOR to beating team BBB 4-2 that day.

The code I used to calculate the rolling average is as follows:

dfrollavg = (df[['HGoals','AGoals']]
            .stack()
            .groupby(df[['HomeTeam','AwayTeam']].stack().values)
            .rolling(3, min_periods = 3).mean().shift(1)
            .reset_index(level=0, drop=True)
            .unstack()
            .add_prefix('Avg_')

That gives me a dataframe with just the rolling averages and no other info, so I put those columns back into the original dataframe to give me my desired result.

df['HGRollA'] = dfrollavg['Avg_HGoals'].round(2)
df['AGoalRA'] = dfrollavg['Avg_AGoals'].round(2)

Now, here are the 2 problems this code is causing me.

  1. shift(1) is in there because I want the code for the rolling average to be of the last 3 matches, NOT the last 2 matches + the current row. However, one weird thing that is happening as you can see is that the shift is bringing in values to the first 10 rows of the dataframe, which should not be happening, and I'm not sure why. The first 30 rows or so of this dataframe should all have NaN calculated because there are not 3 unique observations per team until roughly that point. For some reason though, shift(1) puts values into the first 10 rows (but not the next 20). If I change it to shift(0), this goes away...but of course then the rolling average doesn't calculate the previous 3 games as I want, but instead the past 2 + the current row.

  2. I have multiple seasons in this dataframe. In the start of a new season there will always be 3 new teams in the dataframe for which there have been no games. So if on the first day of season 2011, team AAA plays team CCC, and team CCC was not in the league last season (2010, the first year of the dataset), then team CCC shouldn't have a rolling average calculated for them and should be NaN until they have 3 games played in the dataset. Team AAA was in the league last season so it's fine for them to have a rolling average calculated. For some reason though, my code is assigning team CCC a rolling average right away.

If I had to guess I'd say it's either my code calculating the rolling average that is messing up somehow, or perhaps when I insert this code back in as columns in the original dataframe?

来源:https://stackoverflow.com/questions/61426946/rolling-average-calculating-some-values-it-shouldnt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!