Pandas: rolling mean by time interval

匿名 (未验证) 提交于 2019-12-03 09:02:45

问题:

I'm new to Pandas.... I've got a bunch of polling data; I want to compute a rolling mean to get an estimate for each day based on a three-day window. As I understand from this question, the rolling_* functions compute the window based on a specified number of values, and not a specific datetime range.

Is there a different function that implements this functionality? Or am I stuck writing my own?

EDIT:

Sample input data:

polls_subset.tail(20) Out[185]:              favorable  unfavorable  other  enddate                                   2012-10-25       0.48         0.49   0.03 2012-10-25       0.51         0.48   0.02 2012-10-27       0.51         0.47   0.02 2012-10-26       0.56         0.40   0.04 2012-10-28       0.48         0.49   0.04 2012-10-28       0.46         0.46   0.09 2012-10-28       0.48         0.49   0.03 2012-10-28       0.49         0.48   0.03 2012-10-30       0.53         0.45   0.02 2012-11-01       0.49         0.49   0.03 2012-11-01       0.47         0.47   0.05 2012-11-01       0.51         0.45   0.04 2012-11-03       0.49         0.45   0.06 2012-11-04       0.53         0.39   0.00 2012-11-04       0.47         0.44   0.08 2012-11-04       0.49         0.48   0.03 2012-11-04       0.52         0.46   0.01 2012-11-04       0.50         0.47   0.03 2012-11-05       0.51         0.46   0.02 2012-11-07       0.51         0.41   0.00 

Output would have only one row for each date.

EDIT x2: fixed typo

回答1:

What about something like this:

First resample the data frame into 1D intervals. This takes the mean of the values for all duplicate days. Use the fill_method option to fill in missing date values. Next, pass the resampled frame into pd.rolling_mean with a window of 3 and min_periods=1 :

pd.rolling_mean(df.resample("1D", fill_method="ffill"), window=3, min_periods=1)              favorable  unfavorable     other enddate 2012-10-25   0.495000     0.485000  0.025000 2012-10-26   0.527500     0.442500  0.032500 2012-10-27   0.521667     0.451667  0.028333 2012-10-28   0.515833     0.450000  0.035833 2012-10-29   0.488333     0.476667  0.038333 2012-10-30   0.495000     0.470000  0.038333 2012-10-31   0.512500     0.460000  0.029167 2012-11-01   0.516667     0.456667  0.026667 2012-11-02   0.503333     0.463333  0.033333 2012-11-03   0.490000     0.463333  0.046667 2012-11-04   0.494000     0.456000  0.043333 2012-11-05   0.500667     0.452667  0.036667 2012-11-06   0.507333     0.456000  0.023333 2012-11-07   0.510000     0.443333  0.013333 

UPDATE: As Ben points out in the comments, with pandas 0.18.0 the syntax has changed. With the new syntax this would be:

df.resample("1d").sum().fillna(0).rolling(window=3, min_periods=1).mean() 


回答2:

I just had the same question but with irregularly spaced datapoints. Resample is not really an option here. So I created my own function. Maybe it will be useful for others too:

from pandas import Series, DataFrame import pandas as pd from datetime import datetime, timedelta import numpy as np  def rolling_mean(data, window, min_periods=1, center=False):     ''' Function that computes a rolling mean      Parameters     ----------     data : DataFrame or Series            If a DataFrame is passed, the rolling_mean is computed for all columns.     window : int or string              If int is passed, window is the number of observations used for calculating               the statistic, as defined by the function pd.rolling_mean()              If a string is passed, it must be a frequency string, e.g. '90S'. This is              internally converted into a DateOffset object, representing the window size.     min_periods : int                   Minimum number of observations in window required to have a value.      Returns     -------     Series or DataFrame, if more than one column         '''     def f(x):         '''Function to apply that actually computes the rolling mean'''         if center == False:             dslice = col[x-pd.datetools.to_offset(window).delta+timedelta(0,0,1):x]                 # adding a microsecond because when slicing with labels start and endpoint                 # are inclusive         else:             dslice = col[x-pd.datetools.to_offset(window).delta/2+timedelta(0,0,1):                          x+pd.datetools.to_offset(window).delta/2]         if dslice.size 


回答3:

In the meantime, a time-window capability was added. See the link below:

https://github.com/pydata/pandas/pull/13513

In [1]: df = DataFrame({'B': range(5)})  In [2]: df.index = [Timestamp('20130101 09:00:00'),    ...:             Timestamp('20130101 09:00:02'),    ...:             Timestamp('20130101 09:00:03'),    ...:             Timestamp('20130101 09:00:05'),    ...:             Timestamp('20130101 09:00:06')]  In [3]: df Out[3]:                       B 2013-01-01 09:00:00  0 2013-01-01 09:00:02  1 2013-01-01 09:00:03  2 2013-01-01 09:00:05  3 2013-01-01 09:00:06  4  In [4]: df.rolling(2, min_periods=1).sum() Out[4]:                         B 2013-01-01 09:00:00  0.0 2013-01-01 09:00:02  1.0 2013-01-01 09:00:03  3.0 2013-01-01 09:00:05  5.0 2013-01-01 09:00:06  7.0  In [5]: df.rolling('2s', min_periods=1).sum() Out[5]:                         B 2013-01-01 09:00:00  0.0 2013-01-01 09:00:02  1.0 2013-01-01 09:00:03  3.0 2013-01-01 09:00:05  3.0 2013-01-01 09:00:06  7.0 


回答4:

user2689410's code was exactly what I needed. Providing my version (credits to user2689410), which is faster due to calculating mean at once for whole rows in the DataFrame.

Hope my suffix conventions are readable: _s: string, _i: int, _b: bool, _ser: Series and _df: DataFrame. Where you find multiple suffixes, type can be both.

import pandas as pd from datetime import datetime, timedelta import numpy as np  def time_offset_rolling_mean_df_ser(data_df_ser, window_i_s, min_periods_i=1, center_b=False):     """ Function that computes a rolling mean      Credit goes to user2689410 at http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval      Parameters     ----------     data_df_ser : DataFrame or Series          If a DataFrame is passed, the time_offset_rolling_mean_df_ser is computed for all columns.     window_i_s : int or string          If int is passed, window_i_s is the number of observations used for calculating          the statistic, as defined by the function pd.time_offset_rolling_mean_df_ser()          If a string is passed, it must be a frequency string, e.g. '90S'. This is          internally converted into a DateOffset object, representing the window_i_s size.     min_periods_i : int          Minimum number of observations in window_i_s required to have a value.      Returns     -------     Series or DataFrame, if more than one column      >>> idx = [     ...     datetime(2011, 2, 7, 0, 0),     ...     datetime(2011, 2, 7, 0, 1),     ...     datetime(2011, 2, 7, 0, 1, 30),     ...     datetime(2011, 2, 7, 0, 2),     ...     datetime(2011, 2, 7, 0, 4),     ...     datetime(2011, 2, 7, 0, 5),     ...     datetime(2011, 2, 7, 0, 5, 10),     ...     datetime(2011, 2, 7, 0, 6),     ...     datetime(2011, 2, 7, 0, 8),     ...     datetime(2011, 2, 7, 0, 9)]     >>> idx = pd.Index(idx)     >>> vals = np.arange(len(idx)).astype(float)     >>> ser = pd.Series(vals, index=idx)     >>> df = pd.DataFrame({'s1':ser, 's2':ser+1})     >>> time_offset_rolling_mean_df_ser(df, window_i_s='2min')                           s1   s2     2011-02-07 00:00:00  0.0  1.0     2011-02-07 00:01:00  0.5  1.5     2011-02-07 00:01:30  1.0  2.0     2011-02-07 00:02:00  2.0  3.0     2011-02-07 00:04:00  4.0  5.0     2011-02-07 00:05:00  4.5  5.5     2011-02-07 00:05:10  5.0  6.0     2011-02-07 00:06:00  6.0  7.0     2011-02-07 00:08:00  8.0  9.0     2011-02-07 00:09:00  8.5  9.5     """      def calculate_mean_at_ts(ts):         """Function (closure) to apply that actually computes the rolling mean"""         if center_b == False:             dslice_df_ser = data_df_ser[                 ts-pd.datetools.to_offset(window_i_s).delta+timedelta(0,0,1):                 ts             ]             # adding a microsecond because when slicing with labels start and endpoint             # are inclusive         else:             dslice_df_ser = data_df_ser[                 ts-pd.datetools.to_offset(window_i_s).delta/2+timedelta(0,0,1):                 ts+pd.datetools.to_offset(window_i_s).delta/2             ]         if  (isinstance(dslice_df_ser, pd.DataFrame) and dslice_df_ser.shape[0] 


回答5:

This example seems to call for a weighted mean as suggested in @andyhayden's comment. For example, there are two polls on 10/25 and one each on 10/26 and 10/27. If you just resample and then take the mean, this effectively gives twice as much weighting to the polls on 10/26 and 10/27 compared to the ones on 10/25.

To give equal weight to each poll rather than equal weight to each day, you could do something like the following.

>>> wt = df.resample('D',limit=5).count()              favorable  unfavorable  other enddate                                   2012-10-25          2            2      2 2012-10-26          1            1      1 2012-10-27          1            1      1  >>> df2 = df.resample('D').mean()              favorable  unfavorable  other enddate                                   2012-10-25      0.495        0.485  0.025 2012-10-26      0.560        0.400  0.040 2012-10-27      0.510        0.470  0.020 

That gives you the raw ingredients for doing a poll-based mean instead of a day-based mean. As before, the polls are averaged on 10/25, but the weight for 10/25 is also stored and is double the weight on 10/26 or 10/27 to reflect that two polls were taken on 10/25.

>>> df3 = df2 * wt >>> df3 = df3.rolling(3,min_periods=1).sum() >>> wt3 = wt.rolling(3,min_periods=1).sum()  >>> df3 = df3 / wt3                favorable  unfavorable     other enddate                                      2012-10-25   0.495000     0.485000  0.025000 2012-10-26   0.516667     0.456667  0.030000 2012-10-27   0.515000     0.460000  0.027500 2012-10-28   0.496667     0.465000  0.041667 2012-10-29   0.484000     0.478000  0.042000 2012-10-30   0.488000     0.474000  0.042000 2012-10-31   0.530000     0.450000  0.020000 2012-11-01   0.500000     0.465000  0.035000 2012-11-02   0.490000     0.470000  0.040000 2012-11-03   0.490000     0.465000  0.045000 2012-11-04   0.500000     0.448333  0.035000 2012-11-05   0.501429     0.450000  0.032857 2012-11-06   0.503333     0.450000  0.028333 2012-11-07   0.510000     0.435000  0.010000 

Note that the rolling mean for 10/27 is now 0.51500 (poll-weighted) rather than 52.1667 (day-weighted).

Also note that there have been changes to the APIs for resample and rolling as of version 0.18.0.

rolling (what's new in pandas 0.18.0)

resample (what's new in pandas 0.18.0)



回答6:

I found that user2689410 code broke when I tried with window='1M' as the delta on business month threw this error:

AttributeError: 'MonthEnd' object has no attribute 'delta' 

I added the option to pass directly a relative time delta, so you can do similar things for user defined periods.

Thanks for the pointers, here's my attempt - hope it's of use.

def rolling_mean(data, window, min_periods=1, center=False): """ Function that computes a rolling mean Reference:     http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval  Parameters ---------- data : DataFrame or Series        If a DataFrame is passed, the rolling_mean is computed for all columns. window : int, string, Timedelta or Relativedelta          int - number of observations used for calculating the statistic,                as defined by the function pd.rolling_mean()          string - must be a frequency string, e.g. '90S'. This is                   internally converted into a DateOffset object, and then                   Timedelta representing the window size.          Timedelta / Relativedelta - Can directly pass a timedeltas. min_periods : int               Minimum number of observations in window required to have a value. center : bool          Point around which to 'center' the slicing.  Returns ------- Series or DataFrame, if more than one column """ def f(x, time_increment):     """Function to apply that actually computes the rolling mean     :param x:     :return:     """     if not center:         # adding a microsecond because when slicing with labels start         # and endpoint are inclusive         start_date = x - time_increment + timedelta(0, 0, 1)         end_date = x     else:         start_date = x - time_increment/2 + timedelta(0, 0, 1)         end_date = x + time_increment/2     # Select the date index from the     dslice = col[start_date:end_date]      if dslice.size 

And the example with a 3 day time window to calculate the mean:

from pandas import Series, DataFrame import pandas as pd from datetime import datetime, timedelta import numpy as np from dateutil.relativedelta import relativedelta  idx = [datetime(2011, 2, 7, 0, 0),            datetime(2011, 2, 7, 0, 1),            datetime(2011, 2, 8, 0, 1, 30),            datetime(2011, 2, 9, 0, 2),            datetime(2011, 2, 10, 0, 4),            datetime(2011, 2, 11, 0, 5),            datetime(2011, 2, 12, 0, 5, 10),            datetime(2011, 2, 12, 0, 6),            datetime(2011, 2, 13, 0, 8),            datetime(2011, 2, 14, 0, 9)] idx = pd.Index(idx) vals = np.arange(len(idx)).astype(float) s = Series(vals, index=idx) # Now try by passing the 3 days as a relative time delta directly. rm = rolling_mean(s, window=relativedelta(days=3)) >>> rm Out[2]:  2011-02-07 00:00:00    0.0 2011-02-07 00:01:00    0.5 2011-02-08 00:01:30    1.0 2011-02-09 00:02:00    1.5 2011-02-10 00:04:00    3.0 2011-02-11 00:05:00    4.0 2011-02-12 00:05:10    5.0 2011-02-12 00:06:00    5.5 2011-02-13 00:08:00    6.5 2011-02-14 00:09:00    7.5 Name: 0, dtype: float64 


回答7:

To keep it basic, I used a loop and something like this to get you started (my index are datetimes):

import pandas as pd import datetime as dt  #populate your dataframe: "df" #...  df[df.index

and then you can run functions on that slice. You can see how adding an iterator to make the start of the window something other than the first value in your dataframes index would then roll the window (you could use a > rule for the start as well for example).

Note, this may be less efficient for SUPER large data or very small increments as your slicing may become more strenuous (works for me well enough for hundreds of thousands of rows of data and several columns though for hourly windows across a few weeks)



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!