Let\'s say that I have some data generated as follows:
N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
and t
The accepted answer works and is elegant. Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):
means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
df = df.merge(means_stds,on='indx')
df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']
To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answer and then merge and normalize for each column separately:
means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
df = df.merge(means_stds,on='indx')
for col in ['a0','a1']:
df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']