Normalize DataFrame by group

后端 未结 4 671
梦谈多话
梦谈多话 2021-02-07 02:04

Let\'s say that I have some data generated as follows:

N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3

and t

4条回答
  •  自闭症患者
    2021-02-07 02:26

    If the data contains many groups (thousands or more), the accepted answer may take a very long time to compute.

    Even though groupby.transform itself is fast, as are the already vectorized calls in the lambda function (.mean(), .std() and the subtraction), the call to the pure Python function for each group creates a considerable overhead.

    This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.

    We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform:

    def normalize_by_group(df, by):
        groups = df.groupby(by)
        # computes group-wise mean/std,
        # then auto broadcasts to size of group chunk
        mean = groups.transform(np.mean)
        std = groups.transform(np.std)
        return (df[mean.columns] - mean) / std
    

    For benchmarking I changed the data generation from the original question to allow for more groups:

    def gen_data(N, num_groups):
        m = 3
        data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
        indx = np.random.randint(0,num_groups,size=N).astype(np.int32)
    
        df = pd.DataFrame(np.hstack((data, indx[:,None])), 
                          columns=['a%s' % k for k in range(m)] + [ 'indx'])
        return df
    

    With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:

    In: df2g = gen_data(10000, 2)  # 3 cols, 10000 rows, 2 groups
    
    In: %timeit normalize_by_group(df2g, "indx")
    6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
    12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:

    In: df1000g = gen_data(10000, 1000)  # 3 cols, 10000 rows, 1000 groups
    
    In: %timeit normalize_by_group(df1000g, "indx")
    7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
    2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

提交回复
热议问题