问题
I have a dataframe like this:
col1 col2
0 a 100
1 a 200
2 a 150
3 b 1000
4 c 400
5 c 200
what I want to do is group by col1 and count the number of occurrences and if count is equal or greater than 2, then calculate mean of col2 for those rows and if not returns null. The output should be:
col1 mean
0 a 150
1 b
2 c 300
回答1:
Use groupby.mean + DataFrame.where with Series.value_counts:
df.groupby('col1').mean().where(df['col1'].value_counts().ge(2)).reset_index()
#you can select columns you want
#(df.groupby('col1')[['col2']]
# .mean()
# .where(df['col1'].value_counts().ge(2)).reset_index())
Output
col1 col2
0 a 150.0
1 b NaN
2 c 300.0
if you really want blanks:
df.groupby('col1').mean().where(df['col1'].value_counts().ge(2), '').reset_index()
col1 col2
0 a 150
1 b
2 c 300
回答2:
Custom agg
function
df.groupby('col1').agg(lambda d: np.nan if len(d) == 1 else d.mean())
col2
col1
a 150.0
b NaN
c 300.0
回答3:
I'd go with GroupBy
and mask
:
g = df.groupby('col1')
g.mean().mask(g.size().eq(1))
col2
col1
a 150.0
b NaN
c 300.0
回答4:
df.groupby('col1')['col2'].apply(lambda x: x.mean() if x.count() >= 2 else np.nan)
col1
a 150.0
b NaN
c 300.0
Edit:
%timeit df.groupby('col1')['col2'].apply(lambda x: x.mean() if x.count() >= 2 else np.nan)
2.36 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# piRSquared
%timeit df.groupby('col1').agg(lambda d: np.nan if len(d) == 1 else d.mean())
5.9 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ansev
%timeit df.groupby('col1').mean().where(df['col1'].value_counts().ge(2)).reset_index()
7.01 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
来源:https://stackoverflow.com/questions/60325296/pandas-groupby-count-and-then-conditional-mean