Pandas的时间序列数据高级处理(27)

只愿长相守 提交于 2019-12-15 08:03:05

Pandas的时间序列数据-Period时期

Pandas的Period可以定义一个时期,或者说具体的一个时段。有这个时段的起始时间start_time、终止时间end_time等属性信息,其参数freq和之前的date_range里的freq参数类似,可以取'S'、'D'等。

import pandas as pd
p = pd.Period('2018-12-15', freq = "A")
print p.start_time, p.end_time, p + 1, p
print pd.Period('2013-1-9 11:22:33', freq='S') + 1
print pd.Period('2013-1-9 11:22:33', freq='T') + 1
print pd.Period('2013-1-9 11:22:33', freq='H') + 1
print pd.Period('2013-1-9 11:22:33', freq='D') + 1
print pd.Period('2013-1-9 11:22:33', freq='M') + 1
print pd.Period('2013-1-9 11:22:33', freq='A') + 1

程序的执行结果如下:

2018-12-01 00:00:00 2018-12-31 23:59:59.999999999 2019-01
2018-01-01 00:00:00 2018-12-31 23:59:59.999999999 2019 2018
2013-01-09 11:22:34 # S 秒
2013-01-09 11:23 # T 分
2013-01-09 12:00 # H 时
2013-01-10 # D 天
2013-02 # M 月
2014 # A 年

Period数据类型的属性有:

day Get day of the month that a Period falls on.
dayofweek Return the day of the week.
dayofyear Return the day of the year.
days_in_month Get the total number of days in the month that this period falls on.
daysinmonth Get the total number of days of the month that the Period falls in.
hour Get the hour of the day component of the Period.
minute Get minute of the hour component of the Period.
second Get the second component of the Period.
start_time Get the Timestamp for the start of the period.
week Get the week of the year on the given Period.

下面可以编写程序使用一下这些属性。

import pandas as pd
att = ["S", "T", "H", "D", "M", "A"]
for a in att:
    p = pd.Period('2018-12-19 11:22:33', freq= a)
    print "freq =", a
    print "Start from:", p.start_time, " End at:", p.end_time
    print "Day",p.day, "Dayofweek", p.dayofweek,"dayofyear", p.dayofyear,"daysinmonth", p.daysinmonth
    print "hour", p.hour, "minute", p.minute, "second", p.second, "\n"

程序的执行结果:

freq = S
Start from: 2018-12-19 11:22:33  End at: 2018-12-19 11:22:33.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 22 second 33 

freq = T
Start from: 2018-12-19 11:22:00  End at: 2018-12-19 11:22:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 22 second 0 

freq = H
Start from: 2018-12-19 11:00:00  End at: 2018-12-19 11:59:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 0 second 0 

freq = D
Start from: 2018-12-19 00:00:00  End at: 2018-12-19 23:59:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 0 minute 0 second 0 

freq = M
Start from: 2018-12-01 00:00:00  End at: 2018-12-31 23:59:59.999999999
Day 31 Dayofweek 0 dayofyear 365 daysinmonth 31
hour 0 minute 0 second 0 

freq = A
Start from: 2018-01-01 00:00:00  End at: 2018-12-31 23:59:59.999999999
Day 31 Dayofweek 0 dayofyear 365 daysinmonth 31
hour 0 minute 0 second 0 

 

Pandas的时间序列数据-period_range

可以通过pandas的period_range函数产生时间序列作为series的index。

import pandas as pd
import numpy as np
att = ["S", "T", "H", "D", "M", "A"]
vi = np.random.randn(5)
for a in att:
    pi = pd.period_range('2018-12-19 11:22:33', periods = 5, freq= a)
    ts = pd.Series(vi, index = pi)
    print ts, "\n"

程序的执行结果:

2018-12-19 11:22:33   -0.275161
2018-12-19 11:22:34   -0.763390
2018-12-19 11:22:35   -2.012351
2018-12-19 11:22:36   -1.126492
2018-12-19 11:22:37    0.843842
Freq: S, dtype: float64 

2018-12-19 11:22   -0.275161
2018-12-19 11:23   -0.763390
2018-12-19 11:24   -2.012351
2018-12-19 11:25   -1.126492
2018-12-19 11:26    0.843842
Freq: T, dtype: float64 

2018-12-19 11:00   -0.275161
2018-12-19 12:00   -0.763390
2018-12-19 13:00   -2.012351
2018-12-19 14:00   -1.126492
2018-12-19 15:00    0.843842
Freq: H, dtype: float64 

2018-12-19   -0.275161
2018-12-20   -0.763390
2018-12-21   -2.012351
2018-12-22   -1.126492
2018-12-23    0.843842
Freq: D, dtype: float64 

2018-12   -0.275161
2019-01   -0.763390
2019-02   -2.012351
2019-03   -1.126492
2019-04    0.843842
Freq: M, dtype: float64 

2018   -0.275161
2019   -0.763390
2020   -2.012351
2021   -1.126492
2022    0.843842
Freq: A-DEC, dtype: float64 

 

Pandas的时间序列数据-时序处理

之前在介绍时序数据的时候基本上时间作为index,提供values值产生了Series数据,一般时序index和values一一对齐,现实使用pandas处理数据会发现数据value和index存在位置差,需要将values前移或整体后移,这个时候可以借助pandas的shift函数来移动一下数值数据values.有的时候会发现index过密,想缩短时间学列的间隔值,这个时候可以考虑用asfreq和resample来调整时间序列的间隔。

移动对齐

这里是处理上述的第一个问题,也就是数据和时间序列位置存在对齐的问题,可以移动values数据,可以移动index即时间序列。

import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19', periods = 5))
print t0
t1 = t0.shift(1)
print t1
t2 = t1.fillna(method = "bfill")
print t2

程序执行结果:

2018-12-19    5 # t0
2018-12-20    4
2018-12-21    3
2018-12-22    2
2018-12-23    1
Freq: D, dtype: int64
2018-12-19    NaN # t1
2018-12-20    5.0
2018-12-21    4.0
2018-12-22    3.0
2018-12-23    2.0
Freq: D, dtype: float64
2018-12-19    5.0 # t2
2018-12-20    5.0
2018-12-21    4.0
2018-12-22    3.0
2018-12-23    2.0
Freq: D, dtype: float64

从结果可以看出,series的values整体向下移动了一下,而index没有发生变化。对于移动后的series可以使用fillna函数来简单清洗一下

shift函数默认的freq参数为'D'即以天作为单位,通过修改freq的值,可以进行其他的修改,即实现对index的移动。

  • freq = 'B', 工作日为调整单位。
import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19', periods = 5))
print t0
t1 = t0.shift(1, freq = "B")
print t1
  • freq = "H", 以小时为单位修改index时间。
import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:30:50', periods = 5))
print t0
t1 = t0.shift(1, freq = "2H")
print t1

程序执行结果:

2018-12-19 10:30:50    5 # t0
2018-12-20 10:30:50    4
2018-12-21 10:30:50    3
2018-12-22 10:30:50    2
2018-12-23 10:30:50    1
Freq: D, dtype: int64
2018-12-19 12:30:50    5 # t1
2018-12-20 12:30:50    4
2018-12-21 12:30:50    3
2018-12-22 12:30:50    2
2018-12-23 12:30:50    1
Freq: D, dtype: int64
import numpy as np
import pandas as pd
from pandas import DateOffset
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:30:50', periods = 5))
print t0
t1 = t0.shift(1, DateOffset(hours = 0.5))
print t1

程序执行结果:

2018-12-19 10:30:50    5 # t0
2018-12-20 10:30:50    4
2018-12-21 10:30:50    3
2018-12-22 10:30:50    2
2018-12-23 10:30:50    1
Freq: D, dtype: int64
2018-12-19 11:00:50    5 # t1
2018-12-20 11:00:50    4
2018-12-21 11:00:50    3
2018-12-22 11:00:50    2
2018-12-23 11:00:50    1
Freq: D, dtype: int64

可以看出时间序列整体往后调整了半小时。

时间频率调整

这里是处理的第二个问题,即原有的时间需类过密或者过稀,可以通过asfreq来调整时间序列的间隔时间,需要注意的是调整后数据是否能对应的上的问题,可采用均值、插值来填充调整后的时间序列所对应的数据。

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("D")
print t1[:13]

程序的执行结果:

2018-12-19 10:00:00     0
2018-12-19 12:00:00     1
2018-12-19 14:00:00     2
2018-12-19 16:00:00     3
2018-12-19 18:00:00     4
2018-12-19 20:00:00     5
2018-12-19 22:00:00     6
2018-12-20 00:00:00     7
2018-12-20 02:00:00     8
2018-12-20 04:00:00     9
2018-12-20 06:00:00    10
2018-12-20 08:00:00    11
2018-12-20 10:00:00    12
Freq: 2H, dtype: int64
2018-12-19 10:00:00      0
2018-12-20 10:00:00     12
2018-12-21 10:00:00     24
2018-12-22 10:00:00     36
2018-12-23 10:00:00     48
2018-12-24 10:00:00     60
2018-12-25 10:00:00     72
2018-12-26 10:00:00     84
2018-12-27 10:00:00     96
2018-12-28 10:00:00    108
2018-12-29 10:00:00    120
2018-12-30 10:00:00    132
2018-12-31 10:00:00    144
Freq: D, dtype: int64

通过asfreq函数,将原来的时间序列有间隔2小时变为了间隔一天,新生成的时间序列如果在原序列里有对应值,那么用原来的values,作为新时间序列的values,例如2018-12-20 10:00:00 12。但是如果调整后的时间序列没有原值能对应上,新时间序列里values会出现NaN。

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H")
print t1[:13]

程序执行结果:

2018-12-19 10:00:00     0
2018-12-19 12:00:00     1
2018-12-19 14:00:00     2
2018-12-19 16:00:00     3
2018-12-19 18:00:00     4
2018-12-19 20:00:00     5
2018-12-19 22:00:00     6
2018-12-20 00:00:00     7
2018-12-20 02:00:00     8
2018-12-20 04:00:00     9
2018-12-20 06:00:00    10
2018-12-20 08:00:00    11
2018-12-20 10:00:00    12
Freq: 2H, dtype: int64
2018-12-19 10:00:00    0.0
2018-12-19 11:00:00    NaN
2018-12-19 12:00:00    1.0
2018-12-19 13:00:00    NaN
2018-12-19 14:00:00    2.0
2018-12-19 15:00:00    NaN
2018-12-19 16:00:00    3.0
2018-12-19 17:00:00    NaN
2018-12-19 18:00:00    4.0
2018-12-19 19:00:00    NaN
2018-12-19 20:00:00    5.0
2018-12-19 21:00:00    NaN
2018-12-19 22:00:00    6.0
Freq: H, dtype: float64

2018-12-19 11:00:00在原来的Series里t0,没有对应值,可以用fillna来处理填充。

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H")
print t1[:13]
t2 = t1.fillna(method = "bfill")
print t2[:13]

程序执行结果:

2018-12-19 10:00:00     0
2018-12-19 12:00:00     1
2018-12-19 14:00:00     2
2018-12-19 16:00:00     3
2018-12-19 18:00:00     4
2018-12-19 20:00:00     5
2018-12-19 22:00:00     6
2018-12-20 00:00:00     7
2018-12-20 02:00:00     8
2018-12-20 04:00:00     9
2018-12-20 06:00:00    10
2018-12-20 08:00:00    11
2018-12-20 10:00:00    12
Freq: 2H, dtype: int64
2018-12-19 10:00:00    0.0
2018-12-19 11:00:00    NaN
2018-12-19 12:00:00    1.0
2018-12-19 13:00:00    NaN
2018-12-19 14:00:00    2.0
2018-12-19 15:00:00    NaN
2018-12-19 16:00:00    3.0
2018-12-19 17:00:00    NaN
2018-12-19 18:00:00    4.0
2018-12-19 19:00:00    NaN
2018-12-19 20:00:00    5.0
2018-12-19 21:00:00    NaN
2018-12-19 22:00:00    6.0
Freq: H, dtype: float64
2018-12-19 10:00:00    0.0
2018-12-19 11:00:00    1.0
2018-12-19 12:00:00    1.0
2018-12-19 13:00:00    2.0
2018-12-19 14:00:00    2.0
2018-12-19 15:00:00    3.0
2018-12-19 16:00:00    3.0
2018-12-19 17:00:00    4.0
2018-12-19 18:00:00    4.0
2018-12-19 19:00:00    5.0
2018-12-19 20:00:00    5.0
2018-12-19 21:00:00    6.0
2018-12-19 22:00:00    6.0
Freq: H, dtype: float64

或者在asfreq里使用method,例如:

import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H", method = "bfill")
print t1[:13]

Pandas的时间序列数据-resample重采样

在pandas里对时序的频率的调整称之重新采样,即从一个时频调整为另一个时频的操作,可以借助resample的函数来完成。有upsampling和downsampling(高频变低频)两种。resample后的数据类型有类似'groupby'的接口函数可以调用得到相关数据信息。时序数据经resample后返回Resamper Object,而Resampler 是定义在pandas.core.resample模块里的一个类,可以通过dir查看该类的一些接口函数。

liao@liao:~/md$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import pandas.core.resample as pcr
>>> dir(pcr.Resampler)
['__bytes__', ......, '_wrap_result', 'agg', 'aggregate', 'apply', 'asfreq', 'ax', 'backfill', 'bfill', 'count', 'ffill', 'fillna', 'first', 'get_group', 'groups', 'indices', 'interpolate', 'last', 'max', 'mean', 'median', 'min', 'ndim', 'nearest', 'ngroups', 'nunique', 'obj', 'ohlc', 'pad', 'pipe', 'plot', 'prod', 'sem', 'size', 'std', 'sum', 'transform', 'var']

可以看出有mean、pad、ohlc、std、fisrt、fillna等接口函数可以对resample后的数据进行处理

downsampling 下(降)采用处理

以高频时间序列变低频时间粒度变大数据聚合,原来有100个时间点,假设变为低频的10个点,那么会将原数据每10个数据组成一组(bucket),原来是100个时间点,100个数据,现在是10个时间点,应该有10个数据,那么这10个数据应该是什么呢?可以对每组里的数据的均值mean,或组里的第一个值first、或最后一个last,最为重采样后的数据来进行下一步处理或....。这就是要借助resample后的数据类型调用相应的接口函数来取得。 由于resample函数的参数众多,较为难理解,现在先做一个时序,如下图所示:

import numpy as np
import pandas as pd
c = 21
v = np.arange(1, c)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx

程序执行结果:

tx -------------------- 
2018-12-01     1
2018-12-02     2
2018-12-03     3
2018-12-04     4
2018-12-05     5
2018-12-06     6
2018-12-07     7
2018-12-08     8
2018-12-09     9
2018-12-10    10
2018-12-11    11
2018-12-12    12
2018-12-13    13
2018-12-14    14
2018-12-15    15
2018-12-16    16
2018-12-17    17
2018-12-18    18
2018-12-19    19
2018-12-20    20
Freq: D, dtype: int64

程序的执行结果和图是一一对应的,即2018-12-01的数据为1。 好,现在对tx这个时序进行降采样,每4天为一个组进行分段segment,那么可以这样去分组(用数学的区域概念来描述)

  • [2018-12-01,2018-12-05)为第一组,这样2018-12-01可以落在这个区间里,
  • [2018-12-05, 2018-12-09)为第二组,
  • [2018-12-09,2018-12-13)为第三组,
  • [2018-12-13,2018-12-17)为第四组,
  • [2018-12-17,2018-12-21)为第五组,第五组的日期2018-12-21尽管不在数据里,可以补齐。这样分组的特点的是左闭右开

当然,也可采用左开右闭的区间描述这几个分组:

  • (2018-11-27,2018-12-01]是第一分组,是为了让第一个时间2018-12-01能落在第一个左开右闭的分组,
  • (2018-12-01, 2010-12-05]为第二组,
  • (2018-12-05, 2010-12-09]为第三组,
  • (2018-12-09, 2010-12-013]为第四组,
  • (2018-12-13, 2010-12-17]为第五组,
  • (2018-12-17, 2010-12-21]为第六组。 这里,多出来的一组是因为第一时间点要落在第一分组里的要求。
import numpy as np
import pandas as pd

v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left'     ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right").sum()
print "tf closed = 'right'    ", "-" * 5, "\n",tf

程序结果:

tx -------------------- 
2018-12-01     1
2018-12-02     2
....<省略>....
2018-12-19    19
2018-12-20    20
Freq: D, dtype: int64
tf closed using default ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'left'      ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'right'     ----- 
2018-11-27     1
2018-12-01    14
2018-12-05    30
2018-12-09    46
2018-12-13    62
2018-12-17    57
dtype: int64

从语句

tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left'     ", "-" * 5, "\n",tf

的输出结果可以看出,resample函数默认closed参数值为left,即左闭右开。所以2018-12-01的输出值10 = 1 + 2 + 3 + 42018-12-02的输出值26 = 5 + 6 + 7 + 8。而当resample采用左开右闭时,第一区间里就只有2018-12-01这一天的数据据,所以和为1,奇怪的是第一项数据输出的index不是2018-12-01而是2018-11-27,而第二项输出的index却是2018-12-01,这是为什么?这里得看resample的第二个令人费解的参数label了,label参数是指输出时使用index是用区间的左界值还是右界值呢?例如(a, b]或[a, b)是用左界值a还右边界值b?

import numpy as np
import pandas as pd

v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left'     ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right").sum()
print "tf closed = 'right'    ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right", label = "right").sum()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf

程序执行结果:

tx -------------------- 
2018-12-01     1
....<省略>....
2018-12-20    20
Freq: D, dtype: int64
tf closed using default ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'left'      ----- 
2018-12-01    10
2018-12-05    26
2018-12-09    42
2018-12-13    58
2018-12-17    74
dtype: int64
tf closed = 'right'     ----- 
2018-11-27     1
2018-12-01    14
2018-12-05    30
2018-12-09    46
2018-12-13    62
2018-12-17    57
dtype: int64
tf closed = 'right' label = 'right' 
2018-12-01     1
2018-12-05    14
2018-12-09    30
2018-12-13    46
2018-12-17    62
2018-12-21    57
dtype: int64

从语句

tf = tx.resample("4d", closed = "right", label = "right").sum()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf

的输出结果可以看到,第一项输出的index已经变成了2018-12-01了,求和为1,也是正确的,第二项2018-12-05的value为14即14 = 2 + 3 + 4 + 5也是对的,且有6组数据也是和之前分析是对的。

ohlc函数

在金融领域经常关系开盘、收盘和最高最低价,resample数据后可以进行这样的操作,pandas数据经resample后可以调用嗯ohlc函数得到汇总数据。

import numpy as np
import pandas as pd

v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d", closed = "right", label = "right").ohlc()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf

程序执行结果:

tx -------------------- 
2018-12-01     1
2018-12-02     2
....<省略>....
2018-12-19    19
2018-12-20    20
Freq: D, dtype: int64
tf closed = 'right' label = 'right'  
            open  high  low  close
2018-12-01     1     1    1      1
2018-12-05     2     5    2      5
2018-12-09     6     9    6      9
2018-12-13    10    13   10     13
2018-12-17    14    17   14     17
2018-12-21    18    20   18     20

upsampling上(升)采样处理

低频变高频会出现大量的NaN数据,可以用method指定填充数据的方式。

import numpy as np
import pandas as pd
v = np.arange(1, 21)
#print v
t0 = pd.Series(v, index = pd.date_range('2018-12-01', periods = 20))
#print t0
print "first", "*" * 22
print t0.resample("6H").first()[:10]
print "bfill", "*" * 22
print t0.resample("6H").bfill()[:10]
print "ffill", "*" * 22
print t0.resample("6H").ffill()[:10]
print "interpolate", "*" * 16
print t0.resample("6H").interpolate()[:10]

程序执行结果如下:

first **********************
2018-12-01 00:00:00    1.0
2018-12-01 06:00:00    NaN
2018-12-01 12:00:00    NaN
2018-12-01 18:00:00    NaN
2018-12-02 00:00:00    2.0
2018-12-02 06:00:00    NaN
2018-12-02 12:00:00    NaN
2018-12-02 18:00:00    NaN
2018-12-03 00:00:00    3.0
2018-12-03 06:00:00    NaN
Freq: 6H, dtype: float64
bfill **********************
2018-12-01 00:00:00    1
2018-12-01 06:00:00    2
2018-12-01 12:00:00    2
2018-12-01 18:00:00    2
2018-12-02 00:00:00    2
2018-12-02 06:00:00    3
2018-12-02 12:00:00    3
2018-12-02 18:00:00    3
2018-12-03 00:00:00    3
2018-12-03 06:00:00    4
Freq: 6H, dtype: int32
ffill **********************
2018-12-01 00:00:00    1
2018-12-01 06:00:00    1
2018-12-01 12:00:00    1
2018-12-01 18:00:00    1
2018-12-02 00:00:00    2
2018-12-02 06:00:00    2
2018-12-02 12:00:00    2
2018-12-02 18:00:00    2
2018-12-03 00:00:00    3
2018-12-03 06:00:00    3
Freq: 6H, dtype: int32
interpolate ****************
2018-12-01 00:00:00    1.00
2018-12-01 06:00:00    1.25
2018-12-01 12:00:00    1.50
2018-12-01 18:00:00    1.75
2018-12-02 00:00:00    2.00
2018-12-02 06:00:00    2.25
2018-12-02 12:00:00    2.50
2018-12-02 18:00:00    2.75
2018-12-03 00:00:00    3.00
2018-12-03 06:00:00    3.25
Freq: 6H, dtype: float64

Pandas的时间序列-滑动窗口

什么是滑动(移动)窗口?为了提升数据的准确性,将某个点的取值扩大到包含这个点的一段区间,用区间来进行判断,这个区间就是窗口。例如想使用2011年1月1日的一个数据,单取这个时间点的数据当然是可行的,但是太过绝对,有没有更好的办法呢?可以选取2010年12月16日到2011年1月15日,通过求均值来评估1月1日这个点的值,2010-12-16到2011-1-15就是一个窗口,窗口的长度window=30. 移动窗口就是窗口向一端滑行,每次滑动(行)并不是区间整块的滑行,而是一个单位一个单位的滑行。例如窗口2010-12-16到2011-1-15,下一个窗口并不是2011-1-15到2011-2-15,而是2010-12-17到2011-1-16(假设数据的截取是以天为单位),整体向右移动一个单位,而不是一个窗口。这样统计的每个值始终都是30单位的均值。 窗口中的值从覆盖整个窗口的位置开始产生,在此之前即为NaN,举例如下:窗口大小为10,前9个都不足够为一个一个窗口的长度,因此都无法取值。

pandas里常用的滑动窗口函数有:

函数名 函数功能
rolling_count(arg, window[, freq, center, how]) Rolling count of number of non-NaN observations inside provided window.
rolling_sum(arg, window[, min_periods, ...]) Moving sum.
rolling_mean(arg, window[, min_periods, ...]) Moving mean.
rolling_median(arg, window[, min_periods, ...]) O(N log(window)) implementation using skip list
rolling_var(arg, window[, min_periods, ...]) Numerically stable implementation using Welford’s method.
rolling_std(arg, window[, min_periods, ...]) Moving standard deviation.
rolling_min(arg, window[, min_periods, ...]) Moving min of 1d array of dtype=float64 along axis=0 ignoring NaNs.
rolling_max(arg, window[, min_periods, ...]) Moving max of 1d array of dtype=float64 along axis=0 ignoring NaNs.
rolling_corr(arg1[, arg2, window, ...]) Moving sample correlation.
rolling_corr_pairwise(df1[, df2, window, ...]) Deprecated.
rolling_cov(arg1[, arg2, window, ...]) Unbiased moving covariance.
rolling_skew(arg, window[, min_periods, ...]) Unbiased moving skewness.
rolling_kurt(arg, window[, min_periods, ...]) Unbiased moving kurtosis.
rolling_apply(arg, window, func[, ...]) Generic moving function application.
rolling_quantile(arg, window, quantile[, ...]) Moving quantile.
rolling_window(arg[, window, win_type, ...]) Applies a moving window of type window_type and size window on the data.

下面以求滑动窗口均值为例给出一个滑动窗口应用程序,如下所示:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
v = np.random.randn(20)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
#print "tx", "-" * 20, "\n", tx
rm = tx.rolling(window = 5, center = False).mean()
rm.plot()
tx.plot()
plt.show()

程序执行结果:

可视图中绿色设tx,蓝色则是rm即滑动窗口处理后均值的可视化输出。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!