Resample in a rolling window using pandas

喜你入骨 提交于 2019-12-05 05:40:42

I had a similar issue dealing with a timedelta series where I wanted to take a moving average and then resample. Here is an example where I have 100 seconds of data. I take a rolling average of 10 second windows and then resample for every 5 seconds, taking the first entry in each resample bin. The result should be the previous 10 second average at 5 second increments. You could do something similar with month format instead of seconds:

df = pd.DataFrame(range(0,100), index=pd.TimedeltaIndex(range(0,100),'s'))
df.rolling('10s').mean().resample('5s').first()

Result:

             0
00:00:00   0.0
00:00:05   2.5
00:00:10   5.5
00:00:15  10.5
00:00:20  15.5
00:00:25  20.5
00:00:30  25.5
00:00:35  30.5
00:00:40  35.5
00:00:45  40.5
00:00:50  45.5
00:00:55  50.5
00:01:00  55.5
00:01:05  60.5
00:01:10  65.5
00:01:15  70.5
00:01:20  75.5
00:01:25  80.5
00:01:30  85.5
00:01:35  90.5

Here's an attempt - not super clean, but it might work.

Dummy data:

df = pd.DataFrame(data={'a': 1.}, 
                  index=pd.date_range(start='2001-1-1', periods=1000))

First define a function to decrease a date n number of months. This needs to be cleaned up, but works for n<=12.

from datetime import datetime    
def decrease_month(date, n):
    assert(n <= 12)

    new_month = date.month - n
    year_offset = 0
    if new_month <= 0:
        year_offset = -1
        new_month = 12 + new_month

    return datetime(date.year + year_offset, new_month, 1)

Then, add 5 new columns for the 5 rolling periods that each date will cross.

for n in range(rolling_period):
    df['m_' + str(n)] = df.index.map(lambda x: decrease_month(x, n))

Then - use the melt function to convert the data from wide to long, so each rolling period will have one entry.

df_m = pd.melt(df, id_vars='a')

You should be able to groupby the newly created column, and each date will represent the right 5 month rolling period.

In [222]: df_m.groupby('value').sum()
Out[222]: 
              a
value          
2000-09-01   31
2000-10-01   59
2000-11-01   90
2000-12-01  120
2001-01-01  151
2001-02-01  150
2001-03-01  153
2001-04-01  153
2001-05-01  153
2001-06-01  153
2001-07-01  153
...

I have solved a similar problem with the following code:

interval = 5
frames = []
for base in range(interval):
  frame = data.resample(f"{interval}min", base=base).last()
  frames.append(frame)

pd.concat(frames, axis=0).sort_index()

Here I create 5 data frames which are resampled at the same interval, but have different offsets (the base parameter). Then I just have to concatenate and sort them. Should usually be much more efficient than rolling + resampling (the only overhead is the sorting).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!