Python: look back n days rolling standard deviation

匿名 (未验证) 提交于 2019-12-03 01:34:02

问题:

I have a question about dealing with the rolling standard deviation:

The data frame looks like this:

2010-01-20 05:00:00   -0.011 2010-01-20 05:02:00   -0.032 2010-01-20 05:02:00   -0.037 2010-01-20 05:04:00    0.001 2010-01-20 05:06:00    0.023 2010-01-20 05:06:00    0.011 2010-01-20 05:08:00    0.049 2010-01-20 05:10:00    0.102 .... 2010-05-20 17:00:00    0.022 

This is 2-min data from 5am to 5pm (The format of index 'yyyy-mm-dd hh:mm:ss' is datestamp)

I want to calculate the 8-day look-back on the standard deviation. My intuition is to split the data frame into daily data set and then calculate the rolling standard deviation, but I don't know how to deal with these indexand i guess my methods may takes a lot of time to calculate. Thanks a lot for your help!

Finally, I would like the result like this:

2010-01-20   0.0 2010-01-21   0.0 2010-01-22   0.0 .... 2010-01-26   0.0 2010-01-27   0.12 2010-01-28   0.02 2010-01-29   0.07 ... 2010-05-20   0.10 

Thank you for your help. @unutbu

Just found the problem in the data: The data frame is not completely including the whole 2-min data. For example:

2010-01-21 15:08:00    0.044 2010-01-22 05:10:00    0.102 

The data ends at 15:08 on 2010-01-21 and start at 05:10:00 on 2010-01-22. so setting window size with a constant may not fixed this problem. Any suggestions? thanks a lot

回答1:

If the time series has a constant frequency:

You could compute the number of 2 second interals in 8 days:

window_size = pd.Timedelta('8D')/pd.Timedelta('2min') 

and then use pd.rolling_std with window=window_size:

import pandas as pd import numpy as np np.random.seed(1)  index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T') N = len(index) df = pd.DataFrame({'val': np.random.random(N)}, index=index) # the number of 2 second intervals in 8 days window_size = pd.Timedelta('8D')/pd.Timedelta('2min')    # 5760.0  df['std'] = pd.rolling_std(df['val'], window=window_size) print(df.tail()) 

yields

                          val       std 2010-05-20 16:52:00  0.768918  0.291137 2010-05-20 16:54:00  0.486348  0.291098 2010-05-20 16:56:00  0.679610  0.291099 2010-05-20 16:58:00  0.951798  0.291114 2010-05-20 17:00:00  0.059935  0.291109 

To resample this time series so as to get one value per day, you could use the resample method and aggregate the values by taking the mean:

df['std'].resample('D', how='mean') 

yields

... 2010-05-16    0.289019 2010-05-17    0.289988 2010-05-18    0.289713 2010-05-19    0.289269 2010-05-20    0.288890 Freq: D, Name: std, Length: 121 

Above, we computed the rolling standard deviation and then resampled to a time series with daily frequency.

If we were to resample the original data to daily frequency first and then compute the rolling standard deviation then in general the result would be different.

Note also that your data looks like it has quite a bit of variation within each day, so resampling by taking the mean might (wrongly?) hide that variation. So it is probably better to compute the std first.


If the time series does not have a constant frequency:

If you have enough memory, I think the easiest way to deal with this situation is to use asfreq to expand the time series to one that has a constant frequency.

import pandas as pd import numpy as np np.random.seed(1)  # make an example df index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T') N = len(index) df = pd.DataFrame({'val': np.random.random(N)}, index=index) mask = np.random.randint(2, size=N).astype(bool) df = df.loc[mask]  # expand the time series, filling in missing values with NaN df = df.asfreq('2T', method=None)  # now we can use the constant-frequency solution window_size = pd.Timedelta('8D')/pd.Timedelta('2min')     df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1)  result = df['std'].resample('D', how='mean') print(result.head()) 

yields

2010-01-20    0.301834 2010-01-21    0.292505 2010-01-22    0.293897 2010-01-23    0.291018 2010-01-24    0.290444 Freq: D, Name: std, dtype: float64 

The alternative to expanding the time series is to write code to compute the correct sub-Series for each 8-day window. While this is possible, the fact that you would have to compute this for each row of the time series could make this method very slow. Thus, I think the faster approach is to expand the time series.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!