Normalize DataFrame by group

后端 未结 4 689
梦谈多话
梦谈多话 2021-02-07 02:04

Let\'s say that I have some data generated as follows:

N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3

and t

4条回答
  •  南旧
    南旧 (楼主)
    2021-02-07 02:35

    The accepted answer works and is elegant. Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):

    means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
    df = df.merge(means_stds,on='indx')
    df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']
    

    To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answer and then merge and normalize for each column separately:

    means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
    means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
    df = df.merge(means_stds,on='indx')
    for col in ['a0','a1']:
        df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']
    

提交回复
热议问题