pandas groupby count and then conditional mean

问题

I have a dataframe like this:

    col1 col2
0    a   100
1    a   200
2    a   150
3    b   1000
4    c   400
5    c   200

what I want to do is group by col1 and count the number of occurrences and if count is equal or greater than 2, then calculate mean of col2 for those rows and if not returns null. The output should be:

    col1 mean
0    a   150
1    b   
2    c   300

回答1:

Use groupby.mean + DataFrame.where with Series.value_counts:

df.groupby('col1').mean().where(df['col1'].value_counts().ge(2)).reset_index()

#you can select columns you want
#(df.groupby('col1')[['col2']]
#   .mean()
#   .where(df['col1'].value_counts().ge(2)).reset_index())

Output

  col1   col2
0    a  150.0
1    b    NaN
2    c  300.0

if you really want blanks:

df.groupby('col1').mean().where(df['col1'].value_counts().ge(2), '').reset_index()

  col1 col2
0    a  150
1    b     
2    c  300

回答2:

Custom `agg` function

df.groupby('col1').agg(lambda d: np.nan if len(d) == 1 else d.mean())

       col2
col1       
a     150.0
b       NaN
c     300.0

回答3:

I'd go with GroupBy and mask:

g = df.groupby('col1')
g.mean().mask(g.size().eq(1))

      col2
col1       
a     150.0
b       NaN
c     300.0

回答4:

df.groupby('col1')['col2'].apply(lambda x: x.mean() if x.count() >= 2 else np.nan)


col1
a    150.0
b      NaN
c    300.0

Edit:

%timeit df.groupby('col1')['col2'].apply(lambda x: x.mean() if x.count() >= 2 else np.nan)
2.36 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# piRSquared
%timeit df.groupby('col1').agg(lambda d: np.nan if len(d) == 1 else d.mean())
5.9 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# ansev
%timeit df.groupby('col1').mean().where(df['col1'].value_counts().ge(2)).reset_index()
7.01 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

来源：https://stackoverflow.com/questions/60325296/pandas-groupby-count-and-then-conditional-mean

标签

python

pandas

pandas-groupby