Pandas efficient groupby season for every year

后端未结

关注

 2  682

甜味超标 2020-12-12 02:57

I have a multi-year time series an want the bounds between which 95% of my data lie. I want to look at this by season of the year (\'DJF\', \'MAM\', \'JJA\', \'SON\').

2条回答

一个人的身影 (楼主)

2020-12-12 03:11

The fastest so far is a combination of creating a low-frequency timeseries with which to do the season lookup and @Garrett's method of using a numpy.array index lookup rather than a dict.

season_lookup = np.array([
    None,
    'DJF', 'DJF',
    'MAM', 'MAM', 'MAM',
    'JJA', 'JJA', 'JJA',
    'SON', 'SON', 'SON',
    'DJF'])
SEASON_HALO = pd.datetools.relativedelta(months=4)
start_with_halo = df.index.min() - SEASON_HALO
end_with_halo = df.index.max() + SEASON_HALO
seasonal_idx = pd.DatetimeIndex(start=start_with_halo, end=end_with_halo, freq='QS-DEC')
seasonal_ts = pd.DataFrame(index=seasonal_idx)
seasonal_ts[SEAS] = season_lookup[seasonal_ts.index.month]
seasonal_minutely_ts = seasonal_ts.resample(df.index.freq, fill_method='ffill')
df_via_resample = df.join(seasonal_minutely_ts)
gp_up_sample = df_via_resample.groupby(SEAS)
gp_up_sample.quantile(FRAC_2_TAIL)

with 10 years of minute data, on my machine: this is about:

2% faster than low frequency dict lookup then up-sample
7% faster than the normal frequency np.array lookup
>400% improvement on my original method

YMMV

0 讨论(0)

查看其它2个回答