可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm new to Pandas.... I've got a bunch of polling data; I want to compute a rolling mean to get an estimate for each day based on a three-day window. As I understand from this question, the rolling_* functions compute the window based on a specified number of values, and not a specific datetime range.
Is there a different function that implements this functionality? Or am I stuck writing my own?
EDIT:
Sample input data:
polls_subset.tail(20) Out[185]: favorable unfavorable other enddate 2012-10-25 0.48 0.49 0.03 2012-10-25 0.51 0.48 0.02 2012-10-27 0.51 0.47 0.02 2012-10-26 0.56 0.40 0.04 2012-10-28 0.48 0.49 0.04 2012-10-28 0.46 0.46 0.09 2012-10-28 0.48 0.49 0.03 2012-10-28 0.49 0.48 0.03 2012-10-30 0.53 0.45 0.02 2012-11-01 0.49 0.49 0.03 2012-11-01 0.47 0.47 0.05 2012-11-01 0.51 0.45 0.04 2012-11-03 0.49 0.45 0.06 2012-11-04 0.53 0.39 0.00 2012-11-04 0.47 0.44 0.08 2012-11-04 0.49 0.48 0.03 2012-11-04 0.52 0.46 0.01 2012-11-04 0.50 0.47 0.03 2012-11-05 0.51 0.46 0.02 2012-11-07 0.51 0.41 0.00
Output would have only one row for each date.
EDIT x2: fixed typo
回答1:
What about something like this:
First resample the data frame into 1D intervals. This takes the mean of the values for all duplicate days. Use the fill_method
option to fill in missing date values. Next, pass the resampled frame into pd.rolling_mean
with a window of 3 and min_periods=1 :
pd.rolling_mean(df.resample("1D", fill_method="ffill"), window=3, min_periods=1) favorable unfavorable other enddate 2012-10-25 0.495000 0.485000 0.025000 2012-10-26 0.527500 0.442500 0.032500 2012-10-27 0.521667 0.451667 0.028333 2012-10-28 0.515833 0.450000 0.035833 2012-10-29 0.488333 0.476667 0.038333 2012-10-30 0.495000 0.470000 0.038333 2012-10-31 0.512500 0.460000 0.029167 2012-11-01 0.516667 0.456667 0.026667 2012-11-02 0.503333 0.463333 0.033333 2012-11-03 0.490000 0.463333 0.046667 2012-11-04 0.494000 0.456000 0.043333 2012-11-05 0.500667 0.452667 0.036667 2012-11-06 0.507333 0.456000 0.023333 2012-11-07 0.510000 0.443333 0.013333
UPDATE: As Ben points out in the comments, with pandas 0.18.0 the syntax has changed. With the new syntax this would be:
df.resample("1d").sum().fillna(0).rolling(window=3, min_periods=1).mean()
回答2:
I just had the same question but with irregularly spaced datapoints. Resample is not really an option here. So I created my own function. Maybe it will be useful for others too:
from pandas import Series, DataFrame import pandas as pd from datetime import datetime, timedelta import numpy as np def rolling_mean(data, window, min_periods=1, center=False): ''' Function that computes a rolling mean Parameters ---------- data : DataFrame or Series If a DataFrame is passed, the rolling_mean is computed for all columns. window : int or string If int is passed, window is the number of observations used for calculating the statistic, as defined by the function pd.rolling_mean() If a string is passed, it must be a frequency string, e.g. '90S'. This is internally converted into a DateOffset object, representing the window size. min_periods : int Minimum number of observations in window required to have a value. Returns ------- Series or DataFrame, if more than one column ''' def f(x): '''Function to apply that actually computes the rolling mean''' if center == False: dslice = col[x-pd.datetools.to_offset(window).delta+timedelta(0,0,1):x] # adding a microsecond because when slicing with labels start and endpoint # are inclusive else: dslice = col[x-pd.datetools.to_offset(window).delta/2+timedelta(0,0,1): x+pd.datetools.to_offset(window).delta/2] if dslice.size
回答3:
In the meantime, a time-window capability was added. See the link below:
https://github.com/pydata/pandas/pull/13513
In [1]: df = DataFrame({'B': range(5)}) In [2]: df.index = [Timestamp('20130101 09:00:00'), ...: Timestamp('20130101 09:00:02'), ...: Timestamp('20130101 09:00:03'), ...: Timestamp('20130101 09:00:05'), ...: Timestamp('20130101 09:00:06')] In [3]: df Out[3]: B 2013-01-01 09:00:00 0 2013-01-01 09:00:02 1 2013-01-01 09:00:03 2 2013-01-01 09:00:05 3 2013-01-01 09:00:06 4 In [4]: df.rolling(2, min_periods=1).sum() Out[4]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 5.0 2013-01-01 09:00:06 7.0 In [5]: df.rolling('2s', min_periods=1).sum() Out[5]: B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 3.0 2013-01-01 09:00:06 7.0
回答4:
user2689410's code was exactly what I needed. Providing my version (credits to user2689410), which is faster due to calculating mean at once for whole rows in the DataFrame.
Hope my suffix conventions are readable: _s: string, _i: int, _b: bool, _ser: Series and _df: DataFrame. Where you find multiple suffixes, type can be both.
import pandas as pd from datetime import datetime, timedelta import numpy as np def time_offset_rolling_mean_df_ser(data_df_ser, window_i_s, min_periods_i=1, center_b=False): """ Function that computes a rolling mean Credit goes to user2689410 at http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval Parameters ---------- data_df_ser : DataFrame or Series If a DataFrame is passed, the time_offset_rolling_mean_df_ser is computed for all columns. window_i_s : int or string If int is passed, window_i_s is the number of observations used for calculating the statistic, as defined by the function pd.time_offset_rolling_mean_df_ser() If a string is passed, it must be a frequency string, e.g. '90S'. This is internally converted into a DateOffset object, representing the window_i_s size. min_periods_i : int Minimum number of observations in window_i_s required to have a value. Returns ------- Series or DataFrame, if more than one column >>> idx = [ ... datetime(2011, 2, 7, 0, 0), ... datetime(2011, 2, 7, 0, 1), ... datetime(2011, 2, 7, 0, 1, 30), ... datetime(2011, 2, 7, 0, 2), ... datetime(2011, 2, 7, 0, 4), ... datetime(2011, 2, 7, 0, 5), ... datetime(2011, 2, 7, 0, 5, 10), ... datetime(2011, 2, 7, 0, 6), ... datetime(2011, 2, 7, 0, 8), ... datetime(2011, 2, 7, 0, 9)] >>> idx = pd.Index(idx) >>> vals = np.arange(len(idx)).astype(float) >>> ser = pd.Series(vals, index=idx) >>> df = pd.DataFrame({'s1':ser, 's2':ser+1}) >>> time_offset_rolling_mean_df_ser(df, window_i_s='2min') s1 s2 2011-02-07 00:00:00 0.0 1.0 2011-02-07 00:01:00 0.5 1.5 2011-02-07 00:01:30 1.0 2.0 2011-02-07 00:02:00 2.0 3.0 2011-02-07 00:04:00 4.0 5.0 2011-02-07 00:05:00 4.5 5.5 2011-02-07 00:05:10 5.0 6.0 2011-02-07 00:06:00 6.0 7.0 2011-02-07 00:08:00 8.0 9.0 2011-02-07 00:09:00 8.5 9.5 """ def calculate_mean_at_ts(ts): """Function (closure) to apply that actually computes the rolling mean""" if center_b == False: dslice_df_ser = data_df_ser[ ts-pd.datetools.to_offset(window_i_s).delta+timedelta(0,0,1): ts ] # adding a microsecond because when slicing with labels start and endpoint # are inclusive else: dslice_df_ser = data_df_ser[ ts-pd.datetools.to_offset(window_i_s).delta/2+timedelta(0,0,1): ts+pd.datetools.to_offset(window_i_s).delta/2 ] if (isinstance(dslice_df_ser, pd.DataFrame) and dslice_df_ser.shape[0]
回答5:
This example seems to call for a weighted mean as suggested in @andyhayden's comment. For example, there are two polls on 10/25 and one each on 10/26 and 10/27. If you just resample and then take the mean, this effectively gives twice as much weighting to the polls on 10/26 and 10/27 compared to the ones on 10/25.
To give equal weight to each poll rather than equal weight to each day, you could do something like the following.
>>> wt = df.resample('D',limit=5).count() favorable unfavorable other enddate 2012-10-25 2 2 2 2012-10-26 1 1 1 2012-10-27 1 1 1 >>> df2 = df.resample('D').mean() favorable unfavorable other enddate 2012-10-25 0.495 0.485 0.025 2012-10-26 0.560 0.400 0.040 2012-10-27 0.510 0.470 0.020
That gives you the raw ingredients for doing a poll-based mean instead of a day-based mean. As before, the polls are averaged on 10/25, but the weight for 10/25 is also stored and is double the weight on 10/26 or 10/27 to reflect that two polls were taken on 10/25.
>>> df3 = df2 * wt >>> df3 = df3.rolling(3,min_periods=1).sum() >>> wt3 = wt.rolling(3,min_periods=1).sum() >>> df3 = df3 / wt3 favorable unfavorable other enddate 2012-10-25 0.495000 0.485000 0.025000 2012-10-26 0.516667 0.456667 0.030000 2012-10-27 0.515000 0.460000 0.027500 2012-10-28 0.496667 0.465000 0.041667 2012-10-29 0.484000 0.478000 0.042000 2012-10-30 0.488000 0.474000 0.042000 2012-10-31 0.530000 0.450000 0.020000 2012-11-01 0.500000 0.465000 0.035000 2012-11-02 0.490000 0.470000 0.040000 2012-11-03 0.490000 0.465000 0.045000 2012-11-04 0.500000 0.448333 0.035000 2012-11-05 0.501429 0.450000 0.032857 2012-11-06 0.503333 0.450000 0.028333 2012-11-07 0.510000 0.435000 0.010000
Note that the rolling mean for 10/27 is now 0.51500 (poll-weighted) rather than 52.1667 (day-weighted).
Also note that there have been changes to the APIs for resample
and rolling
as of version 0.18.0.
rolling (what's new in pandas 0.18.0)
resample (what's new in pandas 0.18.0)
回答6:
I found that user2689410 code broke when I tried with window='1M' as the delta on business month threw this error:
AttributeError: 'MonthEnd' object has no attribute 'delta'
I added the option to pass directly a relative time delta, so you can do similar things for user defined periods.
Thanks for the pointers, here's my attempt - hope it's of use.
def rolling_mean(data, window, min_periods=1, center=False): """ Function that computes a rolling mean Reference: http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval Parameters ---------- data : DataFrame or Series If a DataFrame is passed, the rolling_mean is computed for all columns. window : int, string, Timedelta or Relativedelta int - number of observations used for calculating the statistic, as defined by the function pd.rolling_mean() string - must be a frequency string, e.g. '90S'. This is internally converted into a DateOffset object, and then Timedelta representing the window size. Timedelta / Relativedelta - Can directly pass a timedeltas. min_periods : int Minimum number of observations in window required to have a value. center : bool Point around which to 'center' the slicing. Returns ------- Series or DataFrame, if more than one column """ def f(x, time_increment): """Function to apply that actually computes the rolling mean :param x: :return: """ if not center: # adding a microsecond because when slicing with labels start # and endpoint are inclusive start_date = x - time_increment + timedelta(0, 0, 1) end_date = x else: start_date = x - time_increment/2 + timedelta(0, 0, 1) end_date = x + time_increment/2 # Select the date index from the dslice = col[start_date:end_date] if dslice.size
And the example with a 3 day time window to calculate the mean:
from pandas import Series, DataFrame import pandas as pd from datetime import datetime, timedelta import numpy as np from dateutil.relativedelta import relativedelta idx = [datetime(2011, 2, 7, 0, 0), datetime(2011, 2, 7, 0, 1), datetime(2011, 2, 8, 0, 1, 30), datetime(2011, 2, 9, 0, 2), datetime(2011, 2, 10, 0, 4), datetime(2011, 2, 11, 0, 5), datetime(2011, 2, 12, 0, 5, 10), datetime(2011, 2, 12, 0, 6), datetime(2011, 2, 13, 0, 8), datetime(2011, 2, 14, 0, 9)] idx = pd.Index(idx) vals = np.arange(len(idx)).astype(float) s = Series(vals, index=idx) # Now try by passing the 3 days as a relative time delta directly. rm = rolling_mean(s, window=relativedelta(days=3)) >>> rm Out[2]: 2011-02-07 00:00:00 0.0 2011-02-07 00:01:00 0.5 2011-02-08 00:01:30 1.0 2011-02-09 00:02:00 1.5 2011-02-10 00:04:00 3.0 2011-02-11 00:05:00 4.0 2011-02-12 00:05:10 5.0 2011-02-12 00:06:00 5.5 2011-02-13 00:08:00 6.5 2011-02-14 00:09:00 7.5 Name: 0, dtype: float64
回答7:
To keep it basic, I used a loop and something like this to get you started (my index are datetimes):
import pandas as pd import datetime as dt #populate your dataframe: "df" #... df[df.index
and then you can run functions on that slice. You can see how adding an iterator to make the start of the window something other than the first value in your dataframes index would then roll the window (you could use a > rule for the start as well for example).
Note, this may be less efficient for SUPER large data or very small increments as your slicing may become more strenuous (works for me well enough for hundreds of thousands of rows of data and several columns though for hourly windows across a few weeks)