问题
The dataframe is as shown
Name Job Salary
john painter 40000
peter engineer 50000
sam plumber 30000
john doctor 500000
john driver 20000
sam carpenter 10000
peter scientist 100000
How can i groupby the column Name and apply normalization for the Salary column on each group?
Expected result:
Name Job Salary
john painter 0.041666
peter engineer 0
sam plumber 1
john doctor 1
john driver 0
sam carpenter 0
peter scientist 1
I have tried the following
data = df.groupby('Name').transform(lambda x: (x - x.min()) / x.max()- x.min())
However, this produces
Salary
0 -19999.960000
1 -50000.000000
2 -9999.333333
3 -19999.040000
4 -20000.000000
5 -10000.000000
6 -49999.500000
回答1:
You are almost there.
>>> df
Name Job Salary
0 john painter 40000
1 peter engineer 50000
2 sam plumber 30000
3 john doctor 500000
4 john driver 20000
5 sam carpenter 10000
6 peter scientist 100000
>>>
>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))
>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
So basically, you just forgot to enclose x.max() - x.min()
in parentheses.
Note that this can be done much faster with a series of vectorized operations.
>>> grouper = df.groupby('Name')['Salary']
>>> maxes = grouper.transform('max')
>>> mins = grouper.transform('min')
>>>
>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
Timings:
>>> # Setup
>>> df = pd.concat([df]*1000, ignore_index=True)
>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group
>>> df
Name Job Salary
0 0 painter 40000
1 0 engineer 50000
2 0 plumber 30000
3 0 doctor 500000
4 1 driver 20000
... ... ... ...
6995 1748 plumber 30000
6996 1749 doctor 500000
6997 1749 driver 20000
6998 1749 carpenter 10000
6999 1749 scientist 100000
[7000 rows x 3 columns]
>>>
>>> # Tests @ i5-6200U CPU @ 2.30GHz
>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))
1.19 s ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
...: grouper = df.groupby('Name')['Salary']
...: maxes = grouper.transform('max')
...: mins = grouper.transform('min')
...: (df.Salary - mins)/(maxes - mins)
...:
...:
3.04 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
来源:https://stackoverflow.com/questions/53961569/normalize-a-column-of-dataframe-using-min-max-normalization-based-on-groupby-of