I have a question about dealing with the rolling standard deviation:
The data frame looks like this:
2010-01-20 05:00:00 -0.011 2010-01-20 05:02:00 -0.032 2010-01-20 05:02:00 -0.037 2010-01-20 05:04:00 0.001 2010-01-20 05:06:00 0.023 2010-01-20 05:06:00 0.011 2010-01-20 05:08:00 0.049 2010-01-20 05:10:00 0.102 .... 2010-05-20 17:00:00 0.022
This is 2-min data from 5am to 5pm (The format of index 'yyyy-mm-dd hh:mm:ss' is datestamp)
I want to calculate the 8-day look-back on the standard deviation. My intuition is to split the data frame into daily data set and then calculate the rolling standard deviation, but I don't know how to deal with these indexand i guess my methods may takes a lot of time to calculate. Thanks a lot for your help!
Finally, I would like the result like this:
2010-01-20 0.0 2010-01-21 0.0 2010-01-22 0.0 .... 2010-01-26 0.0 2010-01-27 0.12 2010-01-28 0.02 2010-01-29 0.07 ... 2010-05-20 0.10
Thank you for your help. @unutbu
Just found the problem in the data: The data frame is not completely including the whole 2-min data. For example:
2010-01-21 15:08:00 0.044 2010-01-22 05:10:00 0.102
The data ends at 15:08 on 2010-01-21 and start at 05:10:00 on 2010-01-22. so setting window size with a constant may not fixed this problem. Any suggestions? thanks a lot
If the time series has a constant frequency:
You could compute the number of 2 second interals in 8 days:
window_size = pd.Timedelta('8D')/pd.Timedelta('2min')
and then use pd.rolling_std
with window=window_size
:
import pandas as pd import numpy as np np.random.seed(1) index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T') N = len(index) df = pd.DataFrame({'val': np.random.random(N)}, index=index) # the number of 2 second intervals in 8 days window_size = pd.Timedelta('8D')/pd.Timedelta('2min') # 5760.0 df['std'] = pd.rolling_std(df['val'], window=window_size) print(df.tail())
yields
val std 2010-05-20 16:52:00 0.768918 0.291137 2010-05-20 16:54:00 0.486348 0.291098 2010-05-20 16:56:00 0.679610 0.291099 2010-05-20 16:58:00 0.951798 0.291114 2010-05-20 17:00:00 0.059935 0.291109
To resample this time series so as to get one value per day, you could use the resample
method and aggregate the values by taking the mean:
df['std'].resample('D', how='mean')
yields
... 2010-05-16 0.289019 2010-05-17 0.289988 2010-05-18 0.289713 2010-05-19 0.289269 2010-05-20 0.288890 Freq: D, Name: std, Length: 121
Above, we computed the rolling standard deviation and then resampled to a time series with daily frequency.
If we were to resample the original data to daily frequency first and then compute the rolling standard deviation then in general the result would be different.
Note also that your data looks like it has quite a bit of variation within each day, so resampling by taking the mean might (wrongly?) hide that variation. So it is probably better to compute the std first.
If the time series does not have a constant frequency:
If you have enough memory, I think the easiest way to deal with this situation is to use asfreq
to expand the time series to one that has a constant frequency.
import pandas as pd import numpy as np np.random.seed(1) # make an example df index = pd.date_range(start='2010-01-20 5:00', end='2010-05-20 17:00', freq='2T') N = len(index) df = pd.DataFrame({'val': np.random.random(N)}, index=index) mask = np.random.randint(2, size=N).astype(bool) df = df.loc[mask] # expand the time series, filling in missing values with NaN df = df.asfreq('2T', method=None) # now we can use the constant-frequency solution window_size = pd.Timedelta('8D')/pd.Timedelta('2min') df['std'] = pd.rolling_std(df['val'], window=window_size, min_periods=1) result = df['std'].resample('D', how='mean') print(result.head())
yields
2010-01-20 0.301834 2010-01-21 0.292505 2010-01-22 0.293897 2010-01-23 0.291018 2010-01-24 0.290444 Freq: D, Name: std, dtype: float64
The alternative to expanding the time series is to write code to compute the correct sub-Series for each 8-day window. While this is possible, the fact that you would have to compute this for each row of the time series could make this method very slow. Thus, I think the faster approach is to expand the time series.