How can I Group By Month from a Date field using Python/Pandas

后端 未结 5 1988
孤独总比滥情好
孤独总比滥情好 2021-02-02 10:57

I have a Data-frame df which is as follows:

| date      | Revenue |
|-----------|---------|
| 6/2/2017  | 100     |
| 5/23/2017 | 200     |
| 5/20/2017 | 300             


        
5条回答
  •  误落风尘
    2021-02-02 11:26

    For DataFrame with many rows, using strftime takes up more time. If the date column already has dtype of datetime64[ns] (can use pd.to_datetime() to convert, or specify parse_dates during csv import, etc.), one can directly access datetime property for groupby labels (Method 3). The speedup is substantial.

    import numpy as np
    import pandas as pd
    
    T = pd.date_range(pd.Timestamp(0), pd.Timestamp.now()).to_frame(index=False)
    T = pd.concat([T for i in range(1,10)])
    T['revenue'] = pd.Series(np.random.randint(1000, size=T.shape[0]))
    T.columns.values[0] = 'date'
    
    print(T.shape) #(159336, 2)
    print(T.dtypes) #date: datetime64[ns], revenue: int32
    

    Method 1: strftime

    %timeit -n 10 -r 7 T.groupby(T['date'].dt.strftime('%B'))['revenue'].sum()
    

    1.47 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    Method 2: Grouper

    %timeit -n 10 -r 7 T.groupby(pd.Grouper(key='date', freq='1M')).sum()
    #NOTE Manually map months as integer {01..12} to strings
    

    56.9 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    Method 3: datetime properties

    %timeit -n 10 -r 7 T.groupby(T['date'].dt.month)['revenue'].sum()
    #NOTE Manually map months as integer {01..12} to strings
    

    34 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

提交回复
热议问题