Pandas efficient groupby season for every year

后端 未结 2 665
甜味超标
甜味超标 2020-12-12 02:57

I have a multi-year time series an want the bounds between which 95% of my data lie. I want to look at this by season of the year (\'DJF\', \'MAM\', \'JJA\', \'SON\').

<
相关标签:
2条回答
  • 2020-12-12 03:11

    The fastest so far is a combination of creating a low-frequency timeseries with which to do the season lookup and @Garrett's method of using a numpy.array index lookup rather than a dict.

    season_lookup = np.array([
        None,
        'DJF', 'DJF',
        'MAM', 'MAM', 'MAM',
        'JJA', 'JJA', 'JJA',
        'SON', 'SON', 'SON',
        'DJF'])
    SEASON_HALO = pd.datetools.relativedelta(months=4)
    start_with_halo = df.index.min() - SEASON_HALO
    end_with_halo = df.index.max() + SEASON_HALO
    seasonal_idx = pd.DatetimeIndex(start=start_with_halo, end=end_with_halo, freq='QS-DEC')
    seasonal_ts = pd.DataFrame(index=seasonal_idx)
    seasonal_ts[SEAS] = season_lookup[seasonal_ts.index.month]
    seasonal_minutely_ts = seasonal_ts.resample(df.index.freq, fill_method='ffill')
    df_via_resample = df.join(seasonal_minutely_ts)
    gp_up_sample = df_via_resample.groupby(SEAS)
    gp_up_sample.quantile(FRAC_2_TAIL)
    

    with 10 years of minute data, on my machine: this is about:

    • 2% faster than low frequency dict lookup then up-sample
    • 7% faster than the normal frequency np.array lookup
    • >400% improvement on my original method

    YMMV

    0 讨论(0)
  • 2020-12-12 03:18

    In case it helps, I would suggest replacing the following list comprehension and dict lookup that you identified as slow:

    month_to_season_dct = {
        1: 'DJF', 2: 'DJF',
        3: 'MAM', 4: 'MAM', 5: 'MAM',
        6: 'JJA', 7: 'JJA', 8: 'JJA',
        9: 'SON', 10: 'SON', 11: 'SON',
        12: 'DJF'
    }
    grp_ary = [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]
    

    with the following, which uses a numpy array as a lookup table.

    month_to_season_lu = np.array([
        None,
        'DJF', 'DJF',
        'MAM', 'MAM', 'MAM',
        'JJA', 'JJA', 'JJA',
        'SON', 'SON', 'SON',
        'DJF'
    ])
    grp_ary = month_to_season_lu[df.index.month]
    

    Here's a timeit comparison of the two approaches on ~3 years of minutely data:

    In [16]: timeit [month_to_season_dct.get(t_stamp.month) for t_stamp in df.index]
    1 loops, best of 3: 12.3 s per loop
    
    In [17]: timeit month_to_season_lu[df.index.month]
    1 loops, best of 3: 549 ms per loop
    
    0 讨论(0)
提交回复
热议问题