Pandas aggregate count distinct

前端 未结 3 898
再見小時候
再見小時候 2020-12-04 07:34

Let\'s say I have a log of user activity and I want to generate a report of total duration and the number of unique users per day.

import numpy as np
import          


        
相关标签:
3条回答
  • 2020-12-04 08:12

    Just adding to the answers already given, the solution using the string "nunique" seems much faster, tested here on ~21M rows dataframe, then grouped to ~2M

    %time _=g.agg({"id": lambda x: x.nunique()})
    CPU times: user 3min 3s, sys: 2.94 s, total: 3min 6s
    Wall time: 3min 20s
    
    %time _=g.agg({"id": pd.Series.nunique})
    CPU times: user 3min 2s, sys: 2.44 s, total: 3min 4s
    Wall time: 3min 18s
    
    %time _=g.agg({"id": "nunique"})
    CPU times: user 14 s, sys: 4.76 s, total: 18.8 s
    Wall time: 24.4 s
    
    0 讨论(0)
  • 2020-12-04 08:15

    'nunique' is an option for .agg() since pandas 0.20.0, so:

    df.groupby('date').agg({'duration': 'sum', 'user_id': 'nunique'})
    
    0 讨论(0)
  • 2020-12-04 08:23

    How about either of:

    >>> df
             date  duration user_id
    0  2013-04-01        30    0001
    1  2013-04-01        15    0001
    2  2013-04-01        20    0002
    3  2013-04-02        15    0002
    4  2013-04-02        30    0002
    >>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
                duration  user_id
    date                         
    2013-04-01        65        2
    2013-04-02        45        1
    >>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
                duration  user_id
    date                         
    2013-04-01        65        2
    2013-04-02        45        1
    
    0 讨论(0)
提交回复
热议问题