Pandas的时间序列数据-Period时期

Pandas的Period可以定义一个时期，或者说具体的一个时段。有这个时段的起始时间start_time、终止时间end_time等属性信息，其参数freq和之前的date_range里的freq参数类似，可以取'S'、'D'等。

import pandas as pd
p = pd.Period('2018-12-15', freq = "A")
print p.start_time, p.end_time, p + 1, p
print pd.Period('2013-1-9 11:22:33', freq='S') + 1
print pd.Period('2013-1-9 11:22:33', freq='T') + 1
print pd.Period('2013-1-9 11:22:33', freq='H') + 1
print pd.Period('2013-1-9 11:22:33', freq='D') + 1
print pd.Period('2013-1-9 11:22:33', freq='M') + 1
print pd.Period('2013-1-9 11:22:33', freq='A') + 1

程序的执行结果如下：

2018-12-01 00:00:00 2018-12-31 23:59:59.999999999 2019-01
2018-01-01 00:00:00 2018-12-31 23:59:59.999999999 2019 2018
2013-01-09 11:22:34 # S 秒
2013-01-09 11:23 # T 分
2013-01-09 12:00 # H 时
2013-01-10 # D 天
2013-02 # M 月
2014 # A 年

Period数据类型的属性有：

`day`	Get day of the month that a Period falls on.
`dayofweek`	Return the day of the week.
`dayofyear`	Return the day of the year.
`days_in_month`	Get the total number of days in the month that this period falls on.
`daysinmonth`	Get the total number of days of the month that the Period falls in.
`hour`	Get the hour of the day component of the Period.
`minute`	Get minute of the hour component of the Period.
`second`	Get the second component of the Period.
`start_time`	Get the Timestamp for the start of the period.
`week`	Get the week of the year on the given Period.

下面可以编写程序使用一下这些属性。

import pandas as pd
att = ["S", "T", "H", "D", "M", "A"]
for a in att:
    p = pd.Period('2018-12-19 11:22:33', freq= a)
    print "freq =", a
    print "Start from:", p.start_time, " End at:", p.end_time
    print "Day",p.day, "Dayofweek", p.dayofweek,"dayofyear", p.dayofyear,"daysinmonth", p.daysinmonth
    print "hour", p.hour, "minute", p.minute, "second", p.second, "\n"

程序的执行结果：

freq = S
Start from: 2018-12-19 11:22:33  End at: 2018-12-19 11:22:33.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 22 second 33 

freq = T
Start from: 2018-12-19 11:22:00  End at: 2018-12-19 11:22:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 22 second 0 

freq = H
Start from: 2018-12-19 11:00:00  End at: 2018-12-19 11:59:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 0 second 0 

freq = D
Start from: 2018-12-19 00:00:00  End at: 2018-12-19 23:59:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 0 minute 0 second 0 

freq = M
Start from: 2018-12-01 00:00:00  End at: 2018-12-31 23:59:59.999999999
Day 31 Dayofweek 0 dayofyear 365 daysinmonth 31
hour 0 minute 0 second 0 

freq = A
Start from: 2018-01-01 00:00:00  End at: 2018-12-31 23:59:59.999999999
Day 31 Dayofweek 0 dayofyear 365 daysinmonth 31
hour 0 minute 0 second 0

Pandas的时间序列数据-period_range

可以通过pandas的period_range函数产生时间序列作为series的index。

import pandas as pd
import numpy as np
att = ["S", "T", "H", "D", "M", "A"]
vi = np.random.randn(5)
for a in att:
    pi = pd.period_range('2018-12-19 11:22:33', periods = 5, freq= a)
    ts = pd.Series(vi, index = pi)
    print ts, "\n"

程序的执行结果：

2018-12-19 11:22:33   -0.275161
2018-12-19 11:22:34   -0.763390
2018-12-19 11:22:35   -2.012351
2018-12-19 11:22:36   -1.126492
2018-12-19 11:22:37    0.843842
Freq: S, dtype: float64 

2018-12-19 11:22   -0.275161
2018-12-19 11:23   -0.763390
2018-12-19 11:24   -2.012351
2018-12-19 11:25   -1.126492
2018-12-19 11:26    0.843842
Freq: T, dtype: float64 

2018-12-19 11:00   -0.275161
2018-12-19 12:00   -0.763390
2018-12-19 13:00   -2.012351
2018-12-19 14:00   -1.126492
2018-12-19 15:00    0.843842
Freq: H, dtype: float64 

2018-12-19   -0.275161
2018-12-20   -0.763390
2018-12-21   -2.012351
2018-12-22   -1.126492
2018-12-23    0.843842
Freq: D, dtype: float64 

2018-12   -0.275161
2019-01   -0.763390
2019-02   -2.012351
2019-03   -1.126492
2019-04    0.843842
Freq: M, dtype: float64 

2018   -0.275161
2019   -0.763390
2020   -2.012351
2021   -1.126492
2022    0.843842
Freq: A-DEC, dtype: float64

Pandas的时间序列数据-时序处理

之前在介绍时序数据的时候基本上时间作为index，提供values值产生了Series数据，一般时序index和values一一对齐，现实使用pandas处理数据会发现数据value和index存在位置差，需要将values前移或整体后移，这个时候可以借助pandas的shift函数来移动一下数值数据values.有的时候会发现index过密，想缩短时间学列的间隔值，这个时候可以考虑用asfreq和resample来调整时间序列的间隔。

移动对齐

这里是处理上述的第一个问题，也就是数据和时间序列位置存在对齐的问题，可以移动values数据，可以移动index即时间序列。

import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19', periods = 5))
print t0
t1 = t0.shift(1)
print t1
t2 = t1.fillna(method = "bfill")
print t2

程序执行结果：

2018-12-19    5 # t0
2018-12-20    4
2018-12-21    3
2018-12-22    2
2018-12-23    1
Freq: D, dtype: int64
2018-12-19    NaN # t1
2018-12-20    5.0
2018-12-21    4.0
2018-12-22    3.0
2018-12-23    2.0
Freq: D, dtype: float64
2018-12-19    5.0 # t2
2018-12-20    5.0
2018-12-21    4.0
2018-12-22    3.0
2018-12-23    2.0
Freq: D, dtype: float64

从结果可以看出，series的values整体向下移动了一下，而index没有发生变化。对于移动后的series可以使用fillna函数来简单清洗一下。

shift函数默认的freq参数为'D'即以天作为单位，通过修改freq的值，可以进行其他的修改，即实现对index的移动。

freq = 'B', 工作日为调整单位。

import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19', periods = 5))
print t0
t1 = t0.shift(1, freq = "B")
print t1

freq = "H"，以小时为单位修改index时间。

import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:30:50', periods = 5))
print t0
t1 = t0.shift(1, freq = "2H")
print t1

程序执行结果：

2018-12-19 10:30:50    5 # t0
2018-12-20 10:30:50    4
2018-12-21 10:30:50    3
2018-12-22 10:30:50    2
2018-12-23 10:30:50    1
Freq: D, dtype: int64
2018-12-19 12:30:50    5 # t1
2018-12-20 12:30:50    4
2018-12-21 12:30:50    3
2018-12-22 12:30:50    2
2018-12-23 12:30:50    1
Freq: D, dtype: int64

通过shift的DateOffset参数调整时序。

import numpy as np
import pandas as pd
from pandas import DateOffset
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:30:50', periods = 5))
print t0
t1 = t0.shift(1, DateOffset(hours = 0.5))
print t1

程序执行结果：

2018-12-19 10:30:50    5 # t0
2018-12-20 10:30:50    4
2018-12-21 10:30:50    3
2018-12-22 10:30:50    2
2018-12-23 10:30:50    1
Freq: D, dtype: int64
2018-12-19 11:00:50    5 # t1
2018-12-20 11:00:50    4
2018-12-21 11:00:50    3
2018-12-22 11:00:50    2
2018-12-23 11:00:50    1
Freq: D, dtype: int64

可以看出时间序列整体往后调整了半小时。

时间频率调整

这里是处理的第二个问题，即原有的时间需类过密或者过稀，可以通过asfreq来调整时间序列的间隔时间，需要注意的是调整后数据是否能对应的上的问题，可采用均值、插值来填充调整后的时间序列所对应的数据。

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("D")
print t1[:13]

程序的执行结果：

2018-12-19 10:00:00     0
2018-12-19 12:00:00     1
2018-12-19 14:00:00     2
2018-12-19 16:00:00     3
2018-12-19 18:00:00     4
2018-12-19 20:00:00     5
2018-12-19 22:00:00     6
2018-12-20 00:00:00     7
2018-12-20 02:00:00     8
2018-12-20 04:00:00     9
2018-12-20 06:00:00    10
2018-12-20 08:00:00    11
2018-12-20 10:00:00    12
Freq: 2H, dtype: int64
2018-12-19 10:00:00      0
2018-12-20 10:00:00     12
2018-12-21 10:00:00     24
2018-12-22 10:00:00     36
2018-12-23 10:00:00     48
2018-12-24 10:00:00     60
2018-12-25 10:00:00     72
2018-12-26 10:00:00     84
2018-12-27 10:00:00     96
2018-12-28 10:00:00    108
2018-12-29 10:00:00    120
2018-12-30 10:00:00    132
2018-12-31 10:00:00    144
Freq: D, dtype: int64

通过asfreq函数，将原来的时间序列有间隔2小时变为了间隔一天，新生成的时间序列如果在原序列里有对应值，那么用原来的values，作为新时间序列的values，例如2018-12-20 10:00:00 12。但是如果调整后的时间序列没有原值能对应上，新时间序列里values会出现NaN。

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H")
print t1[:13]

程序执行结果：

2018-12-19 10:00:00     0
2018-12-19 12:00:00     1
2018-12-19 14:00:00     2
2018-12-19 16:00:00     3
2018-12-19 18:00:00     4
2018-12-19 20:00:00     5
2018-12-19 22:00:00     6
2018-12-20 00:00:00     7
2018-12-20 02:00:00     8
2018-12-20 04:00:00     9
2018-12-20 06:00:00    10
2018-12-20 08:00:00    11
2018-12-20 10:00:00    12
Freq: 2H, dtype: int64
2018-12-19 10:00:00    0.0
2018-12-19 11:00:00    NaN
2018-12-19 12:00:00    1.0
2018-12-19 13:00:00    NaN
2018-12-19 14:00:00    2.0
2018-12-19 15:00:00    NaN
2018-12-19 16:00:00    3.0
2018-12-19 17:00:00    NaN
2018-12-19 18:00:00    4.0
2018-12-19 19:00:00    NaN
2018-12-19 20:00:00    5.0
2018-12-19 21:00:00    NaN
2018-12-19 22:00:00    6.0
Freq: H, dtype: float64

2018-12-19 11:00:00在原来的Series里t0,没有对应值，可以用fillna来处理填充。

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H")
print t1[:13]
t2 = t1.fillna(method = "bfill")
print t2[:13]

程序执行结果：

2018-12-19 10:00:00     0
2018-12-19 12:00:00     1
2018-12-19 14:00:00     2
2018-12-19 16:00:00     3
2018-12-19 18:00:00     4
2018-12-19 20:00:00     5
2018-12-19 22:00:00     6
2018-12-20 00:00:00     7
2018-12-20 02:00:00     8
2018-12-20 04:00:00     9
2018-12-20 06:00:00    10
2018-12-20 08:00:00    11
2018-12-20 10:00:00    12
Freq: 2H, dtype: int64
2018-12-19 10:00:00    0.0
2018-12-19 11:00:00    NaN
2018-12-19 12:00:00    1.0
2018-12-19 13:00:00    NaN
2018-12-19 14:00:00    2.0
2018-12-19 15:00:00    NaN
2018-12-19 16:00:00    3.0
2018-12-19 17:00:00    NaN
2018-12-19 18:00:00    4.0
2018-12-19 19:00:00    NaN
2018-12-19 20:00:00    5.0
2018-12-19 21:00:00    NaN
2018-12-19 22:00:00    6.0
Freq: H, dtype: float64
2018-12-19 10:00:00    0.0
2018-12-19 11:00:00    1.0
2018-12-19 12:00:00    1.0
2018-12-19 13:00:00    2.0
2018-12-19 14:00:00    2.0
2018-12-19 15:00:00    3.0
2018-12-19 16:00:00    3.0
2018-12-19 17:00:00    4.0
2018-12-19 18:00:00    4.0
2018-12-19 19:00:00    5.0
2018-12-19 20:00:00    5.0
2018-12-19 21:00:00    6.0
2018-12-19 22:00:00    6.0
Freq: H, dtype: float64

或者在asfreq里使用method,例如：

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H", method = "bfill")
print t1[:13]

Pandas的时间序列数据-resample重采样

在pandas里对时序的频率的调整称之重新采样，即从一个时频调整为另一个时频的操作，可以借助resample的函数来完成。有upsampling和downsampling(高频变低频)两种。resample后的数据类型有类似'groupby'的接口函数可以调用得到相关数据信息。时序数据经resample后返回Resamper Object，而Resampler 是定义在pandas.core.resample模块里的一个类，可以通过dir查看该类的一些接口函数。

liao@liao:~/md$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import pandas.core.resample as pcr
>>> dir(pcr.Resampler)
['__bytes__', ......, '_wrap_result', 'agg', 'aggregate', 'apply', 'asfreq', 'ax', 'backfill', 'bfill', 'count', 'ffill', 'fillna', 'first', 'get_group', 'groups', 'indices', 'interpolate', 'last', 'max', 'mean', 'median', 'min', 'ndim', 'nearest', 'ngroups', 'nunique', 'obj', 'ohlc', 'pad', 'pipe', 'plot', 'prod', 'sem', 'size', 'std', 'sum', 'transform', 'var']

可以看出有mean、pad、ohlc、std、fisrt、fillna等接口函数可以对resample后的数据进行处理

downsampling 下(降)采用处理

以高频时间序列变低频时间粒度变大数据聚合，原来有100个时间点，假设变为低频的10个点，那么会将原数据每10个数据组成一组(bucket)，原来是100个时间点，100个数据，现在是10个时间点，应该有10个数据，那么这10个数据应该是什么呢？可以对每组里的数据的均值mean，或组里的第一个值first、或最后一个last，最为重采样后的数据来进行下一步处理或....。这就是要借助resample后的数据类型调用相应的接口函数来取得。由于resample函数的参数众多，较为难理解，现在先做一个时序，如下图所示：

import numpy as np
import pandas as pd
c = 21
v = np.arange(1, c)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx

程序执行结果：

tx -------------------- 
2018-12-01     1
2018-12-02     2
2018-12-03     3
2018-12-04     4
2018-12-05     5
2018-12-06     6
2018-12-07     7
2018-12-08     8
2018-12-09     9
2018-12-10    10
2018-12-11    11
2018-12-12    12
2018-12-13    13
2018-12-14    14
2018-12-15    15
2018-12-16    16
2018-12-17    17
2018-12-18    18
2018-12-19    19
2018-12-20    20
Freq: D, dtype: int64

程序的执行结果和图是一一对应的，即2018-12-01的数据为1。好，现在对tx这个时序进行降采样，每4天为一个组进行分段segment，那么可以这样去分组(用数学的区域概念来描述)

[2018-12-01，2018-12-05)为第一组，这样2018-12-01可以落在这个区间里，
[2018-12-05， 2018-12-09)为第二组，
[2018-12-09，2018-12-13)为第三组，
[2018-12-13，2018-12-17)为第四组，
[2018-12-17，2018-12-21)为第五组，第五组的日期2018-12-21尽管不在数据里，可以补齐。这样分组的特点的是左闭右开。

当然，也可采用左开右闭的区间描述这几个分组:

(2018-11-27，2018-12-01]是第一分组，是为了让第一个时间2018-12-01能落在第一个左开右闭的分组，
(2018-12-01, 2010-12-05]为第二组，
(2018-12-05, 2010-12-09]为第三组，
(2018-12-09, 2010-12-013]为第四组，
(2018-12-13, 2010-12-17]为第五组，
(2018-12-17, 2010-12-21]为第六组。这里，多出来的一组是因为第一时间点要落在第一分组里的要求。

import numpy as np
import pandas as pd

v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left'     ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right").sum()
print "tf closed = 'right'    ", "-" * 5, "\n",tf

程序结果：

tx -------------------- 
2018-12-01     1
2018-12-02     2
....<省略>....
2018-12-19    19
2018-12-20    20
Freq: D, dtype: int64
tf closed using default ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'left'      ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'right'     ----- 
2018-11-27     1
2018-12-01    14
2018-12-05    30
2018-12-09    46
2018-12-13    62
2018-12-17    57
dtype: int64

从语句

tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left'     ", "-" * 5, "\n",tf

的输出结果可以看出，resample函数默认closed参数值为left,即左闭右开。所以2018-12-01的输出值10 = 1 + 2 + 3 + 4。2018-12-02的输出值26 = 5 + 6 + 7 + 8。而当resample采用左开右闭时，第一区间里就只有2018-12-01这一天的数据据，所以和为1，奇怪的是第一项数据输出的index不是2018-12-01而是2018-11-27,而第二项输出的index却是2018-12-01,这是为什么？这里得看resample的第二个令人费解的参数label了，label参数是指输出时使用index是用区间的左界值还是右界值呢？例如(a, b]或[a, b)是用左界值a还右边界值b？

import numpy as np
import pandas as pd

v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left'     ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right").sum()
print "tf closed = 'right'    ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right", label = "right").sum()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf

程序执行结果：

tx -------------------- 
2018-12-01     1
....<省略>....
2018-12-20    20
Freq: D, dtype: int64
tf closed using default ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'left'      ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'right'     ----- 
2018-11-27     1
2018-12-01    14
2018-12-05    30
2018-12-09    46
2018-12-13    62
2018-12-17    57
dtype: int64
tf closed = 'right' label = 'right' 
2018-12-01     1
2018-12-05    14
2018-12-09    30
2018-12-13    46
2018-12-17    62
2018-12-21    57
dtype: int64

从语句

tf = tx.resample("4d", closed = "right", label = "right").sum()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf

的输出结果可以看到，第一项输出的index已经变成了2018-12-01了，求和为1,也是正确的，第二项2018-12-05的value为14即14 = 2 + 3 + 4 + 5也是对的，且有6组数据也是和之前分析是对的。

ohlc函数

在金融领域经常关系开盘、收盘和最高最低价，resample数据后可以进行这样的操作，pandas数据经resample后可以调用嗯ohlc函数得到汇总数据。

import numpy as np
import pandas as pd

v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d", closed = "right", label = "right").ohlc()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf

程序执行结果：

tx -------------------- 
2018-12-01     1
2018-12-02     2
....<省略>....
2018-12-19    19
2018-12-20    20
Freq: D, dtype: int64
tf closed = 'right' label = 'right'  
            open  high  low  close
2018-12-01     1     1    1      1
2018-12-05     2     5    2      5
2018-12-09     6     9    6      9
2018-12-13    10    13   10     13
2018-12-17    14    17   14     17
2018-12-21    18    20   18     20

upsampling上(升)采样处理

低频变高频会出现大量的NaN数据，可以用method指定填充数据的方式。

import numpy as np
import pandas as pd
v = np.arange(1, 21)
#print v
t0 = pd.Series(v, index = pd.date_range('2018-12-01', periods = 20))
#print t0
print "first", "*" * 22
print t0.resample("6H").first()[:10]
print "bfill", "*" * 22
print t0.resample("6H").bfill()[:10]
print "ffill", "*" * 22
print t0.resample("6H").ffill()[:10]
print "interpolate", "*" * 16
print t0.resample("6H").interpolate()[:10]

程序执行结果如下：

first **********************
2018-12-01 00:00:00    1.0
2018-12-01 06:00:00    NaN
2018-12-01 12:00:00    NaN
2018-12-01 18:00:00    NaN
2018-12-02 00:00:00    2.0
2018-12-02 06:00:00    NaN
2018-12-02 12:00:00    NaN
2018-12-02 18:00:00    NaN
2018-12-03 00:00:00    3.0
2018-12-03 06:00:00    NaN
Freq: 6H, dtype: float64
bfill **********************
2018-12-01 00:00:00    1
2018-12-01 06:00:00    2
2018-12-01 12:00:00    2
2018-12-01 18:00:00    2
2018-12-02 00:00:00    2
2018-12-02 06:00:00    3
2018-12-02 12:00:00    3
2018-12-02 18:00:00    3
2018-12-03 00:00:00    3
2018-12-03 06:00:00    4
Freq: 6H, dtype: int32
ffill **********************
2018-12-01 00:00:00    1
2018-12-01 06:00:00    1
2018-12-01 12:00:00    1
2018-12-01 18:00:00    1
2018-12-02 00:00:00    2
2018-12-02 06:00:00    2
2018-12-02 12:00:00    2
2018-12-02 18:00:00    2
2018-12-03 00:00:00    3
2018-12-03 06:00:00    3
Freq: 6H, dtype: int32
interpolate ****************
2018-12-01 00:00:00    1.00
2018-12-01 06:00:00    1.25
2018-12-01 12:00:00    1.50
2018-12-01 18:00:00    1.75
2018-12-02 00:00:00    2.00
2018-12-02 06:00:00    2.25
2018-12-02 12:00:00    2.50
2018-12-02 18:00:00    2.75
2018-12-03 00:00:00    3.00
2018-12-03 06:00:00    3.25
Freq: 6H, dtype: float64

Pandas的时间序列-滑动窗口

什么是滑动(移动)窗口？为了提升数据的准确性，将某个点的取值扩大到包含这个点的一段区间，用区间来进行判断，这个区间就是窗口。例如想使用2011年1月1日的一个数据，单取这个时间点的数据当然是可行的，但是太过绝对，有没有更好的办法呢？可以选取2010年12月16日到2011年1月15日，通过求均值来评估1月1日这个点的值，2010-12-16到2011-1-15就是一个窗口，窗口的长度window=30. 移动窗口就是窗口向一端滑行，每次滑动(行)并不是区间整块的滑行，而是一个单位一个单位的滑行。例如窗口2010-12-16到2011-1-15，下一个窗口并不是2011-1-15到2011-2-15，而是2010-12-17到2011-1-16（假设数据的截取是以天为单位），整体向右移动一个单位，而不是一个窗口。这样统计的每个值始终都是30单位的均值。窗口中的值从覆盖整个窗口的位置开始产生，在此之前即为NaN,举例如下：窗口大小为10，前9个都不足够为一个一个窗口的长度，因此都无法取值。

pandas里常用的滑动窗口函数有：

函数名	函数功能
rolling_count(arg, window[, freq, center, how])	Rolling count of number of non-NaN observations inside provided window.
rolling_sum(arg, window[, min_periods, ...])	Moving sum.
rolling_mean(arg, window[, min_periods, ...])	Moving mean.
rolling_median(arg, window[, min_periods, ...])	O(N log(window)) implementation using skip list
rolling_var(arg, window[, min_periods, ...])	Numerically stable implementation using Welford’s method.
rolling_std(arg, window[, min_periods, ...])	Moving standard deviation.
rolling_min(arg, window[, min_periods, ...])	Moving min of 1d array of dtype=float64 along axis=0 ignoring NaNs.
rolling_max(arg, window[, min_periods, ...])	Moving max of 1d array of dtype=float64 along axis=0 ignoring NaNs.
rolling_corr(arg1[, arg2, window, ...])	Moving sample correlation.
rolling_corr_pairwise(df1[, df2, window, ...])	Deprecated.
rolling_cov(arg1[, arg2, window, ...])	Unbiased moving covariance.
rolling_skew(arg, window[, min_periods, ...])	Unbiased moving skewness.
rolling_kurt(arg, window[, min_periods, ...])	Unbiased moving kurtosis.
rolling_apply(arg, window, func[, ...])	Generic moving function application.
rolling_quantile(arg, window, quantile[, ...])	Moving quantile.
rolling_window(arg[, window, win_type, ...])	Applies a moving window of type window_type and size window on the data.

下面以求滑动窗口均值为例给出一个滑动窗口应用程序，如下所示:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
v = np.random.randn(20)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
#print "tx", "-" * 20, "\n", tx
rm = tx.rolling(window = 5, center = False).mean()
rm.plot()
tx.plot()
plt.show()

程序执行结果：

可视图中绿色设tx，蓝色则是rm即滑动窗口处理后均值的可视化输出。

来源：CSDN

作者：†徐先森®

链接：https://blog.csdn.net/qq_36622490/article/details/103479541

标签

pandas

时间序列数据

Pandas的时间序列数据高级处理（27）

Pandas的时间序列数据-Period时期

Pandas的时间序列数据-period_range

Pandas的时间序列数据-时序处理

移动对齐

时间频率调整

Pandas的时间序列数据-resample重采样

downsampling 下(降)采用处理

ohlc函数

upsampling上(升)采样处理

Pandas的时间序列-滑动窗口