Pandas的时间序列数据-Period时期
Pandas的Period可以定义一个时期,或者说具体的一个时段。有这个时段的起始时间start_time、终止时间end_time等属性信息,其参数freq和之前的date_range里的freq参数类似,可以取'S'、'D'等。
import pandas as pd
p = pd.Period('2018-12-15', freq = "A")
print p.start_time, p.end_time, p + 1, p
print pd.Period('2013-1-9 11:22:33', freq='S') + 1
print pd.Period('2013-1-9 11:22:33', freq='T') + 1
print pd.Period('2013-1-9 11:22:33', freq='H') + 1
print pd.Period('2013-1-9 11:22:33', freq='D') + 1
print pd.Period('2013-1-9 11:22:33', freq='M') + 1
print pd.Period('2013-1-9 11:22:33', freq='A') + 1
程序的执行结果如下:
2018-12-01 00:00:00 2018-12-31 23:59:59.999999999 2019-01
2018-01-01 00:00:00 2018-12-31 23:59:59.999999999 2019 2018
2013-01-09 11:22:34 # S 秒
2013-01-09 11:23 # T 分
2013-01-09 12:00 # H 时
2013-01-10 # D 天
2013-02 # M 月
2014 # A 年
Period数据类型的属性有:
day |
Get day of the month that a Period falls on. |
dayofweek |
Return the day of the week. |
dayofyear |
Return the day of the year. |
days_in_month |
Get the total number of days in the month that this period falls on. |
daysinmonth |
Get the total number of days of the month that the Period falls in. |
hour |
Get the hour of the day component of the Period. |
minute |
Get minute of the hour component of the Period. |
second |
Get the second component of the Period. |
start_time |
Get the Timestamp for the start of the period. |
week |
Get the week of the year on the given Period. |
下面可以编写程序使用一下这些属性。
import pandas as pd
att = ["S", "T", "H", "D", "M", "A"]
for a in att:
p = pd.Period('2018-12-19 11:22:33', freq= a)
print "freq =", a
print "Start from:", p.start_time, " End at:", p.end_time
print "Day",p.day, "Dayofweek", p.dayofweek,"dayofyear", p.dayofyear,"daysinmonth", p.daysinmonth
print "hour", p.hour, "minute", p.minute, "second", p.second, "\n"
程序的执行结果:
freq = S
Start from: 2018-12-19 11:22:33 End at: 2018-12-19 11:22:33.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 22 second 33
freq = T
Start from: 2018-12-19 11:22:00 End at: 2018-12-19 11:22:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 22 second 0
freq = H
Start from: 2018-12-19 11:00:00 End at: 2018-12-19 11:59:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 11 minute 0 second 0
freq = D
Start from: 2018-12-19 00:00:00 End at: 2018-12-19 23:59:59.999999999
Day 19 Dayofweek 2 dayofyear 353 daysinmonth 31
hour 0 minute 0 second 0
freq = M
Start from: 2018-12-01 00:00:00 End at: 2018-12-31 23:59:59.999999999
Day 31 Dayofweek 0 dayofyear 365 daysinmonth 31
hour 0 minute 0 second 0
freq = A
Start from: 2018-01-01 00:00:00 End at: 2018-12-31 23:59:59.999999999
Day 31 Dayofweek 0 dayofyear 365 daysinmonth 31
hour 0 minute 0 second 0
Pandas的时间序列数据-period_range
可以通过pandas的period_range函数产生时间序列作为series的index。
import pandas as pd
import numpy as np
att = ["S", "T", "H", "D", "M", "A"]
vi = np.random.randn(5)
for a in att:
pi = pd.period_range('2018-12-19 11:22:33', periods = 5, freq= a)
ts = pd.Series(vi, index = pi)
print ts, "\n"
程序的执行结果:
2018-12-19 11:22:33 -0.275161
2018-12-19 11:22:34 -0.763390
2018-12-19 11:22:35 -2.012351
2018-12-19 11:22:36 -1.126492
2018-12-19 11:22:37 0.843842
Freq: S, dtype: float64
2018-12-19 11:22 -0.275161
2018-12-19 11:23 -0.763390
2018-12-19 11:24 -2.012351
2018-12-19 11:25 -1.126492
2018-12-19 11:26 0.843842
Freq: T, dtype: float64
2018-12-19 11:00 -0.275161
2018-12-19 12:00 -0.763390
2018-12-19 13:00 -2.012351
2018-12-19 14:00 -1.126492
2018-12-19 15:00 0.843842
Freq: H, dtype: float64
2018-12-19 -0.275161
2018-12-20 -0.763390
2018-12-21 -2.012351
2018-12-22 -1.126492
2018-12-23 0.843842
Freq: D, dtype: float64
2018-12 -0.275161
2019-01 -0.763390
2019-02 -2.012351
2019-03 -1.126492
2019-04 0.843842
Freq: M, dtype: float64
2018 -0.275161
2019 -0.763390
2020 -2.012351
2021 -1.126492
2022 0.843842
Freq: A-DEC, dtype: float64
Pandas的时间序列数据-时序处理
之前在介绍时序数据的时候基本上时间作为index,提供values值产生了Series数据,一般时序index和values一一对齐,现实使用pandas处理数据会发现数据value和index存在位置差,需要将values前移或整体后移,这个时候可以借助pandas的shift函数来移动一下数值数据values.有的时候会发现index过密,想缩短时间学列的间隔值,这个时候可以考虑用asfreq和resample来调整时间序列的间隔。
移动对齐
这里是处理上述的第一个问题,也就是数据和时间序列位置存在对齐的问题,可以移动values数据,可以移动index即时间序列。
import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19', periods = 5))
print t0
t1 = t0.shift(1)
print t1
t2 = t1.fillna(method = "bfill")
print t2
程序执行结果:
2018-12-19 5 # t0
2018-12-20 4
2018-12-21 3
2018-12-22 2
2018-12-23 1
Freq: D, dtype: int64
2018-12-19 NaN # t1
2018-12-20 5.0
2018-12-21 4.0
2018-12-22 3.0
2018-12-23 2.0
Freq: D, dtype: float64
2018-12-19 5.0 # t2
2018-12-20 5.0
2018-12-21 4.0
2018-12-22 3.0
2018-12-23 2.0
Freq: D, dtype: float64
从结果可以看出,series的values整体向下移动了一下,而index没有发生变化。对于移动后的series可以使用fillna函数来简单清洗一下。
shift函数默认的freq参数为'D'
即以天作为单位,通过修改freq的值,可以进行其他的修改,即实现对index的移动。
- freq = 'B', 工作日为调整单位。
import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19', periods = 5))
print t0
t1 = t0.shift(1, freq = "B")
print t1
- freq = "H", 以小时为单位修改index时间。
import numpy as np
import pandas as pd
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:30:50', periods = 5))
print t0
t1 = t0.shift(1, freq = "2H")
print t1
程序执行结果:
2018-12-19 10:30:50 5 # t0
2018-12-20 10:30:50 4
2018-12-21 10:30:50 3
2018-12-22 10:30:50 2
2018-12-23 10:30:50 1
Freq: D, dtype: int64
2018-12-19 12:30:50 5 # t1
2018-12-20 12:30:50 4
2018-12-21 12:30:50 3
2018-12-22 12:30:50 2
2018-12-23 12:30:50 1
Freq: D, dtype: int64
- 通过shift的DateOffset参数调整时序。
import numpy as np
import pandas as pd
from pandas import DateOffset
v = [5, 4, 3, 2, 1]
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:30:50', periods = 5))
print t0
t1 = t0.shift(1, DateOffset(hours = 0.5))
print t1
程序执行结果:
2018-12-19 10:30:50 5 # t0
2018-12-20 10:30:50 4
2018-12-21 10:30:50 3
2018-12-22 10:30:50 2
2018-12-23 10:30:50 1
Freq: D, dtype: int64
2018-12-19 11:00:50 5 # t1
2018-12-20 11:00:50 4
2018-12-21 11:00:50 3
2018-12-22 11:00:50 2
2018-12-23 11:00:50 1
Freq: D, dtype: int64
可以看出时间序列整体往后调整了半小时。
时间频率调整
这里是处理的第二个问题,即原有的时间需类过密或者过稀,可以通过asfreq来调整时间序列的间隔时间,需要注意的是调整后数据是否能对应的上的问题,可采用均值、插值来填充调整后的时间序列所对应的数据。
import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("D")
print t1[:13]
程序的执行结果:
2018-12-19 10:00:00 0
2018-12-19 12:00:00 1
2018-12-19 14:00:00 2
2018-12-19 16:00:00 3
2018-12-19 18:00:00 4
2018-12-19 20:00:00 5
2018-12-19 22:00:00 6
2018-12-20 00:00:00 7
2018-12-20 02:00:00 8
2018-12-20 04:00:00 9
2018-12-20 06:00:00 10
2018-12-20 08:00:00 11
2018-12-20 10:00:00 12
Freq: 2H, dtype: int64
2018-12-19 10:00:00 0
2018-12-20 10:00:00 12
2018-12-21 10:00:00 24
2018-12-22 10:00:00 36
2018-12-23 10:00:00 48
2018-12-24 10:00:00 60
2018-12-25 10:00:00 72
2018-12-26 10:00:00 84
2018-12-27 10:00:00 96
2018-12-28 10:00:00 108
2018-12-29 10:00:00 120
2018-12-30 10:00:00 132
2018-12-31 10:00:00 144
Freq: D, dtype: int64
通过asfreq函数,将原来的时间序列有间隔2小时变为了间隔一天,新生成的时间序列如果在原序列里有对应值,那么用原来的values,作为新时间序列的values,例如2018-12-20 10:00:00 12
。但是如果调整后的时间序列没有原值能对应上,新时间序列里values会出现NaN。
import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H")
print t1[:13]
程序执行结果:
2018-12-19 10:00:00 0
2018-12-19 12:00:00 1
2018-12-19 14:00:00 2
2018-12-19 16:00:00 3
2018-12-19 18:00:00 4
2018-12-19 20:00:00 5
2018-12-19 22:00:00 6
2018-12-20 00:00:00 7
2018-12-20 02:00:00 8
2018-12-20 04:00:00 9
2018-12-20 06:00:00 10
2018-12-20 08:00:00 11
2018-12-20 10:00:00 12
Freq: 2H, dtype: int64
2018-12-19 10:00:00 0.0
2018-12-19 11:00:00 NaN
2018-12-19 12:00:00 1.0
2018-12-19 13:00:00 NaN
2018-12-19 14:00:00 2.0
2018-12-19 15:00:00 NaN
2018-12-19 16:00:00 3.0
2018-12-19 17:00:00 NaN
2018-12-19 18:00:00 4.0
2018-12-19 19:00:00 NaN
2018-12-19 20:00:00 5.0
2018-12-19 21:00:00 NaN
2018-12-19 22:00:00 6.0
Freq: H, dtype: float64
2018-12-19 11:00:00
在原来的Series里t0,没有对应值,可以用fillna来处理填充。
import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H")
print t1[:13]
t2 = t1.fillna(method = "bfill")
print t2[:13]
程序执行结果:
2018-12-19 10:00:00 0
2018-12-19 12:00:00 1
2018-12-19 14:00:00 2
2018-12-19 16:00:00 3
2018-12-19 18:00:00 4
2018-12-19 20:00:00 5
2018-12-19 22:00:00 6
2018-12-20 00:00:00 7
2018-12-20 02:00:00 8
2018-12-20 04:00:00 9
2018-12-20 06:00:00 10
2018-12-20 08:00:00 11
2018-12-20 10:00:00 12
Freq: 2H, dtype: int64
2018-12-19 10:00:00 0.0
2018-12-19 11:00:00 NaN
2018-12-19 12:00:00 1.0
2018-12-19 13:00:00 NaN
2018-12-19 14:00:00 2.0
2018-12-19 15:00:00 NaN
2018-12-19 16:00:00 3.0
2018-12-19 17:00:00 NaN
2018-12-19 18:00:00 4.0
2018-12-19 19:00:00 NaN
2018-12-19 20:00:00 5.0
2018-12-19 21:00:00 NaN
2018-12-19 22:00:00 6.0
Freq: H, dtype: float64
2018-12-19 10:00:00 0.0
2018-12-19 11:00:00 1.0
2018-12-19 12:00:00 1.0
2018-12-19 13:00:00 2.0
2018-12-19 14:00:00 2.0
2018-12-19 15:00:00 3.0
2018-12-19 16:00:00 3.0
2018-12-19 17:00:00 4.0
2018-12-19 18:00:00 4.0
2018-12-19 19:00:00 5.0
2018-12-19 20:00:00 5.0
2018-12-19 21:00:00 6.0
2018-12-19 22:00:00 6.0
Freq: H, dtype: float64
或者在asfreq里使用method,例如:
import numpy as np
import pandas as pd
c = 31 * 24
v = np.arange(c)
t0 = pd.Series(v, index = pd.date_range('2018-12-19 10:00:00', periods = c, freq = "2H"))
print t0[:13]
t1 = t0.asfreq("H", method = "bfill")
print t1[:13]
Pandas的时间序列数据-resample重采样
在pandas里对时序的频率的调整称之重新采样,即从一个时频调整为另一个时频的操作,可以借助resample的函数来完成。有upsampling和downsampling(高频变低频)两种。resample后的数据类型有类似'groupby'的接口函数可以调用得到相关数据信息。时序数据经resample后返回Resamper Object,而Resampler 是定义在pandas.core.resample模块里的一个类,可以通过dir查看该类的一些接口函数。
liao@liao:~/md$ python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas.core.resample as pcr
>>> dir(pcr.Resampler)
['__bytes__', ......, '_wrap_result', 'agg', 'aggregate', 'apply', 'asfreq', 'ax', 'backfill', 'bfill', 'count', 'ffill', 'fillna', 'first', 'get_group', 'groups', 'indices', 'interpolate', 'last', 'max', 'mean', 'median', 'min', 'ndim', 'nearest', 'ngroups', 'nunique', 'obj', 'ohlc', 'pad', 'pipe', 'plot', 'prod', 'sem', 'size', 'std', 'sum', 'transform', 'var']
可以看出有mean、pad、ohlc、std、fisrt、fillna等接口函数可以对resample后的数据进行处理
downsampling 下(降)采用处理
以高频时间序列变低频时间粒度变大数据聚合,原来有100个时间点,假设变为低频的10个点,那么会将原数据每10个数据组成一组(bucket),原来是100个时间点,100个数据,现在是10个时间点,应该有10个数据,那么这10个数据应该是什么呢?可以对每组里的数据的均值mean,或组里的第一个值first、或最后一个last,最为重采样后的数据来进行下一步处理或....。这就是要借助resample后的数据类型调用相应的接口函数来取得。 由于resample函数的参数众多,较为难理解,现在先做一个时序,如下图所示:
import numpy as np
import pandas as pd
c = 21
v = np.arange(1, c)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
程序执行结果:
tx --------------------
2018-12-01 1
2018-12-02 2
2018-12-03 3
2018-12-04 4
2018-12-05 5
2018-12-06 6
2018-12-07 7
2018-12-08 8
2018-12-09 9
2018-12-10 10
2018-12-11 11
2018-12-12 12
2018-12-13 13
2018-12-14 14
2018-12-15 15
2018-12-16 16
2018-12-17 17
2018-12-18 18
2018-12-19 19
2018-12-20 20
Freq: D, dtype: int64
程序的执行结果和图是一一对应的,即2018-12-01
的数据为1。 好,现在对tx这个时序进行降采样,每4天为一个组进行分段segment,那么可以这样去分组(用数学的区域概念来描述)
- [2018-12-01,2018-12-05)为第一组,这样
2018-12-01
可以落在这个区间里, - [2018-12-05, 2018-12-09)为第二组,
- [2018-12-09,2018-12-13)为第三组,
- [2018-12-13,2018-12-17)为第四组,
- [2018-12-17,2018-12-21)为第五组,第五组的日期2018-12-21尽管不在数据里,可以补齐。这样分组的特点的是左闭右开。
当然,也可采用左开右闭的区间描述这几个分组:
- (2018-11-27,2018-12-01]是第一分组,是为了让第一个时间
2018-12-01
能落在第一个左开右闭的分组, - (2018-12-01, 2010-12-05]为第二组,
- (2018-12-05, 2010-12-09]为第三组,
- (2018-12-09, 2010-12-013]为第四组,
- (2018-12-13, 2010-12-17]为第五组,
- (2018-12-17, 2010-12-21]为第六组。 这里,多出来的一组是因为第一时间点要落在第一分组里的要求。
import numpy as np
import pandas as pd
v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left' ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right").sum()
print "tf closed = 'right' ", "-" * 5, "\n",tf
程序结果:
tx --------------------
2018-12-01 1
2018-12-02 2
....<省略>....
2018-12-19 19
2018-12-20 20
Freq: D, dtype: int64
tf closed using default -----
2018-12-01 10
2018-12-05 26
2018-12-09 42
2018-12-13 58
2018-12-17 74
dtype: int64
tf closed = 'left' -----
2018-12-01 10
2018-12-05 26
2018-12-09 42
2018-12-13 58
2018-12-17 74
dtype: int64
tf closed = 'right' -----
2018-11-27 1
2018-12-01 14
2018-12-05 30
2018-12-09 46
2018-12-13 62
2018-12-17 57
dtype: int64
从语句
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left' ", "-" * 5, "\n",tf
的输出结果可以看出,resample函数默认closed参数值为left
,即左闭右开。所以2018-12-01
的输出值10 = 1 + 2 + 3 + 4
。2018-12-02
的输出值26 = 5 + 6 + 7 + 8
。而当resample采用左开右闭时,第一区间里就只有2018-12-01
这一天的数据据,所以和为1,奇怪的是第一项数据输出的index不是2018-12-01
而是2018-11-27
,而第二项输出的index却是2018-12-01
,这是为什么?这里得看resample的第二个令人费解的参数label了,label参数是指输出时使用index是用区间的左界值还是右界值呢?例如(a, b]或[a, b)是用左界值a还右边界值b?
import numpy as np
import pandas as pd
v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d").sum()
print "tf closed using default", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "left").sum()
print "tf closed = 'left' ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right").sum()
print "tf closed = 'right' ", "-" * 5, "\n",tf
tf = tx.resample("4d", closed = "right", label = "right").sum()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf
程序执行结果:
tx --------------------
2018-12-01 1
....<省略>....
2018-12-20 20
Freq: D, dtype: int64
tf closed using default -----
2018-12-01 10
2018-12-05 26
2018-12-09 42
2018-12-13 58
2018-12-17 74
dtype: int64
tf closed = 'left' -----
2018-12-01 10
2018-12-05 26
2018-12-09 42
2018-12-13 58
2018-12-17 74
dtype: int64
tf closed = 'right' -----
2018-11-27 1
2018-12-01 14
2018-12-05 30
2018-12-09 46
2018-12-13 62
2018-12-17 57
dtype: int64
tf closed = 'right' label = 'right'
2018-12-01 1
2018-12-05 14
2018-12-09 30
2018-12-13 46
2018-12-17 62
2018-12-21 57
dtype: int64
从语句
tf = tx.resample("4d", closed = "right", label = "right").sum()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf
的输出结果可以看到,第一项输出的index已经变成了2018-12-01
了,求和为1,也是正确的,第二项2018-12-05
的value为14即14 = 2 + 3 + 4 + 5
也是对的,且有6组数据也是和之前分析是对的。
ohlc函数
在金融领域经常关系开盘、收盘和最高最低价,resample数据后可以进行这样的操作,pandas数据经resample后可以调用嗯ohlc函数得到汇总数据。
import numpy as np
import pandas as pd
v = np.arange(1, 21)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
print "tx", "-" * 20, "\n", tx
tf = tx.resample("4d", closed = "right", label = "right").ohlc()
print "tf closed = 'right' label = 'right'", "-" * 0, "\n",tf
程序执行结果:
tx --------------------
2018-12-01 1
2018-12-02 2
....<省略>....
2018-12-19 19
2018-12-20 20
Freq: D, dtype: int64
tf closed = 'right' label = 'right'
open high low close
2018-12-01 1 1 1 1
2018-12-05 2 5 2 5
2018-12-09 6 9 6 9
2018-12-13 10 13 10 13
2018-12-17 14 17 14 17
2018-12-21 18 20 18 20
upsampling上(升)采样处理
低频变高频会出现大量的NaN数据,可以用method指定填充数据的方式。
import numpy as np
import pandas as pd
v = np.arange(1, 21)
#print v
t0 = pd.Series(v, index = pd.date_range('2018-12-01', periods = 20))
#print t0
print "first", "*" * 22
print t0.resample("6H").first()[:10]
print "bfill", "*" * 22
print t0.resample("6H").bfill()[:10]
print "ffill", "*" * 22
print t0.resample("6H").ffill()[:10]
print "interpolate", "*" * 16
print t0.resample("6H").interpolate()[:10]
程序执行结果如下:
first **********************
2018-12-01 00:00:00 1.0
2018-12-01 06:00:00 NaN
2018-12-01 12:00:00 NaN
2018-12-01 18:00:00 NaN
2018-12-02 00:00:00 2.0
2018-12-02 06:00:00 NaN
2018-12-02 12:00:00 NaN
2018-12-02 18:00:00 NaN
2018-12-03 00:00:00 3.0
2018-12-03 06:00:00 NaN
Freq: 6H, dtype: float64
bfill **********************
2018-12-01 00:00:00 1
2018-12-01 06:00:00 2
2018-12-01 12:00:00 2
2018-12-01 18:00:00 2
2018-12-02 00:00:00 2
2018-12-02 06:00:00 3
2018-12-02 12:00:00 3
2018-12-02 18:00:00 3
2018-12-03 00:00:00 3
2018-12-03 06:00:00 4
Freq: 6H, dtype: int32
ffill **********************
2018-12-01 00:00:00 1
2018-12-01 06:00:00 1
2018-12-01 12:00:00 1
2018-12-01 18:00:00 1
2018-12-02 00:00:00 2
2018-12-02 06:00:00 2
2018-12-02 12:00:00 2
2018-12-02 18:00:00 2
2018-12-03 00:00:00 3
2018-12-03 06:00:00 3
Freq: 6H, dtype: int32
interpolate ****************
2018-12-01 00:00:00 1.00
2018-12-01 06:00:00 1.25
2018-12-01 12:00:00 1.50
2018-12-01 18:00:00 1.75
2018-12-02 00:00:00 2.00
2018-12-02 06:00:00 2.25
2018-12-02 12:00:00 2.50
2018-12-02 18:00:00 2.75
2018-12-03 00:00:00 3.00
2018-12-03 06:00:00 3.25
Freq: 6H, dtype: float64
Pandas的时间序列-滑动窗口
什么是滑动(移动)窗口?为了提升数据的准确性,将某个点的取值扩大到包含这个点的一段区间,用区间来进行判断,这个区间就是窗口。例如想使用2011年1月1日的一个数据,单取这个时间点的数据当然是可行的,但是太过绝对,有没有更好的办法呢?可以选取2010年12月16日到2011年1月15日,通过求均值来评估1月1日这个点的值,2010-12-16到2011-1-15就是一个窗口,窗口的长度window=30. 移动窗口就是窗口向一端滑行,每次滑动(行)并不是区间整块的滑行,而是一个单位一个单位的滑行。例如窗口2010-12-16到2011-1-15,下一个窗口并不是2011-1-15到2011-2-15,而是2010-12-17到2011-1-16(假设数据的截取是以天为单位),整体向右移动一个单位,而不是一个窗口。这样统计的每个值始终都是30单位的均值。 窗口中的值从覆盖整个窗口的位置开始产生,在此之前即为NaN,举例如下:窗口大小为10,前9个都不足够为一个一个窗口的长度,因此都无法取值。
pandas里常用的滑动窗口函数有:
函数名 | 函数功能 |
---|---|
rolling_count(arg, window[, freq, center, how]) | Rolling count of number of non-NaN observations inside provided window. |
rolling_sum(arg, window[, min_periods, ...]) | Moving sum. |
rolling_mean(arg, window[, min_periods, ...]) | Moving mean. |
rolling_median(arg, window[, min_periods, ...]) | O(N log(window)) implementation using skip list |
rolling_var(arg, window[, min_periods, ...]) | Numerically stable implementation using Welford’s method. |
rolling_std(arg, window[, min_periods, ...]) | Moving standard deviation. |
rolling_min(arg, window[, min_periods, ...]) | Moving min of 1d array of dtype=float64 along axis=0 ignoring NaNs. |
rolling_max(arg, window[, min_periods, ...]) | Moving max of 1d array of dtype=float64 along axis=0 ignoring NaNs. |
rolling_corr(arg1[, arg2, window, ...]) | Moving sample correlation. |
rolling_corr_pairwise(df1[, df2, window, ...]) | Deprecated. |
rolling_cov(arg1[, arg2, window, ...]) | Unbiased moving covariance. |
rolling_skew(arg, window[, min_periods, ...]) | Unbiased moving skewness. |
rolling_kurt(arg, window[, min_periods, ...]) | Unbiased moving kurtosis. |
rolling_apply(arg, window, func[, ...]) | Generic moving function application. |
rolling_quantile(arg, window, quantile[, ...]) | Moving quantile. |
rolling_window(arg[, window, win_type, ...]) | Applies a moving window of type window_type and size window on the data. |
下面以求滑动窗口均值为例给出一个滑动窗口应用程序,如下所示:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
v = np.random.randn(20)
tx = pd.Series(v)
tx.index = pd.date_range('2018-12-01', periods = 20, freq = "d")
#print "tx", "-" * 20, "\n", tx
rm = tx.rolling(window = 5, center = False).mean()
rm.plot()
tx.plot()
plt.show()
程序执行结果:
可视图中绿色设tx,蓝色则是rm即滑动窗口处理后均值的可视化输出。
来源:CSDN
作者:†徐先森®
链接:https://blog.csdn.net/qq_36622490/article/details/103479541