问题
I want to calculate an expanding z-score for some time series data that I have in a DataFrame, but I want to standardize the data using the mean and standard deviation of multiple columns, rather than the mean and standard deviation within each column separately. I believe that I want to use some combination of groupby and DataFrame.expanding but I can't seem to figure it out. Here's some example data:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.rand(5,5),
columns=list('ABCDE'),
index=pd.date_range('2016-12-31', periods=5))
df.index.name = 'DATE'
df
Input:
Desired output:
I have dates down the rows and the data series as individual columns. What I want is a new DataFrame of the same shape where I've calculated the expanding Z-Score. What I can't figure out how to do is to get the df.expanding(2).mean()
method to aggregate across multiple columns. That is to say, rather than taking the expanding mean of column A and subtracting that from the value in column A, I want to take the expanding mean of the values in columns A through E and subtract that mean from the value in A.
If you think in terms of Excel, what I'm talking about is the difference between =AVERAGE(B$2:B3)
and =AVERAGE($B$2:$F3)
. To do the former is incredibly simple (df.expanding(2).mean()
) but I can't figure out how to do the latter for the life of me.
I've experimented a lot with various combinations of groupby
, stack()
, and expanding()
to no avail.
回答1:
This is my own attempt at trying to calculate the expanding Z-Scores pooling all of the columns. Comments on how to do it more efficiently would be welcome.
def pooled_expanding_zscore(df, min_periods=2):
"""Calculates an expanding Z-Score down the rows of the DataFrame while pooling all of the columns.
Assumes that indexes are not hierarchical.
Assumes that df does not have columns named 'exp_mean' and 'exp_std'.
"""
# Get last sorted column name
colNames = df.columns.values
colNames.sort()
lastCol = colNames[-1]
# Index name
indexName = df.index.name
# Normalize DataFrame
df_stacked = pd.melt(df.reset_index(),id_vars=indexName).sort_values(by=[indexName,'variable'])
# Calculates the expanding mean and standard deviation on df_stacked
# Keeps just the rows where 'variable'==lastCol
df_exp = df_stacked.expanding(2)['value']
df_stacked.loc[:,'exp_mean'] = df_exp.mean()
df_stacked.loc[:,'exp_std'] = df_exp.std()
exp_stats = (df_stacked.loc[df_stacked.variable==lastCol,:]
.reset_index()
.drop(['index','variable','value'], axis=1)
.set_index(indexName))
# add exp_mean and exp_std back to df
df = pd.concat([df,exp_stats],axis=1)
# Calculate Z-Score
df_mat = df.loc[:,colNames].as_matrix()
exp_mean_mat = df.loc[:,'exp_mean'].as_matrix()[:,np.newaxis]
exp_std_mat = df.loc[:,'exp_std'].as_matrix()[:,np.newaxis]
zScores = pd.DataFrame(
(df_mat - exp_mean_mat) / exp_std_mat,
index=df.index,
columns=colNames)
# Use min_periods to kill off early rows
zScores.iloc[:min_periods-1,:] = np.nan
return zScores
来源:https://stackoverflow.com/questions/45044764/pandas-expanding-z-score-across-multiple-columns