Normalize DataFrame by group

后端未结

关注

 4  691

梦谈多话 2021-02-07 02:04

Let\'s say that I have some data generated as follows:

N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3

and t

4条回答

自闭症患者 (楼主)

2021-02-07 02:26

If the data contains many groups (thousands or more), the accepted answer may take a very long time to compute.

Even though groupby.transform itself is fast, as are the already vectorized calls in the lambda function (.mean(), .std() and the subtraction), the call to the pure Python function for each group creates a considerable overhead.

This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.

We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform:

def normalize_by_group(df, by):
    groups = df.groupby(by)
    # computes group-wise mean/std,
    # then auto broadcasts to size of group chunk
    mean = groups.transform(np.mean)
    std = groups.transform(np.std)
    return (df[mean.columns] - mean) / std

For benchmarking I changed the data generation from the original question to allow for more groups:

def gen_data(N, num_groups):
    m = 3
    data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
    indx = np.random.randint(0,num_groups,size=N).astype(np.int32)

    df = pd.DataFrame(np.hstack((data, indx[:,None])), 
                      columns=['a%s' % k for k in range(m)] + [ 'indx'])
    return df

With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:

In: df2g = gen_data(10000, 2)  # 3 cols, 10000 rows, 2 groups

In: %timeit normalize_by_group(df2g, "indx")
6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:

In: df1000g = gen_data(10000, 1000)  # 3 cols, 10000 rows, 1000 groups

In: %timeit normalize_by_group(df1000g, "indx")
7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

0 讨论(0)

查看其它4个回答