问题
Trying to create a new column from the groupby calculation. In the code below, I get the correct calculated values for each date (see group below) but when I try to create a new column (df[\'Data4\']) with it I get NaN. So I am trying to create a new column in the dataframe with the sum of Data3 for the all dates and apply that to each date row. For example, 2015-05-08 is in 2 rows (total is 50+5 = 55) and in this new column I would like to have 55 in both of the rows.
import pandas as pd
import numpy as np
from pandas import DataFrame
df = pd.DataFrame({\'Date\': [\'2015-05-08\', \'2015-05-07\', \'2015-05-06\', \'2015-05-05\', \'2015-05-08\', \'2015-05-07\', \'2015-05-06\', \'2015-05-05\'], \'Sym\': [\'aapl\', \'aapl\', \'aapl\', \'aapl\', \'aaww\', \'aaww\', \'aaww\', \'aaww\'], \'Data2\': [11, 8, 10, 15, 110, 60, 100, 40],\'Data3\': [5, 8, 6, 1, 50, 100, 60, 120]})
group = df[\'Data3\'].groupby(df[\'Date\']).sum()
df[\'Data4\'] = group
回答1:
You want to use transform this will return a Series with the index aligned to the df so you can then add it as a new column:
In [74]:
df = pd.DataFrame({'Date': ['2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05', '2015-05-08', '2015-05-07', '2015-05-06', '2015-05-05'], 'Sym': ['aapl', 'aapl', 'aapl', 'aapl', 'aaww', 'aaww', 'aaww', 'aaww'], 'Data2': [11, 8, 10, 15, 110, 60, 100, 40],'Data3': [5, 8, 6, 1, 50, 100, 60, 120]})
df['Data4'] = df['Data3'].groupby(df['Date']).transform('sum')
df
Out[74]:
Data2 Data3 Date Sym Data4
0 11 5 2015-05-08 aapl 55
1 8 8 2015-05-07 aapl 108
2 10 6 2015-05-06 aapl 66
3 15 1 2015-05-05 aapl 121
4 110 50 2015-05-08 aaww 55
5 60 100 2015-05-07 aaww 108
6 100 60 2015-05-06 aaww 66
7 40 120 2015-05-05 aaww 121
回答2:
How do I create a new column with Groupby().Sum()?
There are two ways - one straightforward and the other slightly more interesting.
Everybody's Favorite: GroupBy.transform() with 'sum'
@Ed Chum's answer can be simplified, a bit. Call DataFrame.groupby rather than Series.groupby. This results in simpler syntax.
# The setup.
df[['Date', 'Data3']]
Date Data3
0 2015-05-08 5
1 2015-05-07 8
2 2015-05-06 6
3 2015-05-05 1
4 2015-05-08 50
5 2015-05-07 100
6 2015-05-06 60
7 2015-05-05 120
df.groupby('Date')['Data3'].transform('sum')
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64
It's a tad faster,
df2 = pd.concat([df] * 12345)
%timeit df2['Data3'].groupby(df['Date']).transform('sum')
%timeit df2.groupby('Date')['Data3'].transform('sum')
10.4 ms ± 367 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.58 ms ± 559 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Unconventional, but Worth your Consideration: GroupBy.sum() + Series.map()
I stumbled upon an interesting idiosyncrasy in the API. From what I tell, you can reproduce this on any major version over 0.20 (I tested this on 0.23 and 0.24). It seems like you consistently can shave off a few milliseconds of the time taken by transform if you instead use a direct function of GroupBy and broadcast it using map:
df.Date.map(df.groupby('Date')['Data3'].sum())
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Date, dtype: int64
Compare with
df.groupby('Date')['Data3'].transform('sum')
0 55
1 108
2 66
3 121
4 55
5 108
6 66
7 121
Name: Data3, dtype: int64
My tests show that map is a bit faster if you can afford to use the direct GroupBy function (such as mean, min, max, first, etc). It is more or less faster for most general situations upto around ~200 thousand records. After that, the performance really depends on the data.
(Left: v0.23, Right: v0.24)
Nice alternative to know, and better if you have smaller frames with smaller numbers of groups. . . but I would recommend transform as a first choice. Thought this was worth sharing anyway.
Benchmarking code, for reference:
import perfplot
perfplot.show(
setup=lambda n: pd.DataFrame({'A': np.random.choice(n//10, n), 'B': np.ones(n)}),
kernels=[
lambda df: df.groupby('A')['B'].transform('sum'),
lambda df: df.A.map(df.groupby('A')['B'].sum()),
],
labels=['GroupBy.transform', 'GroupBy.sum + map'],
n_range=[2**k for k in range(5, 20)],
xlabel='N',
logy=True,
logx=True
)
来源:https://stackoverflow.com/questions/30244952/how-do-i-create-a-new-column-from-the-output-of-pandas-groupby-sum