Clustering similar time series?

ⅰ亾dé卋堺 提交于 2020-12-30 08:14:23

问题


I have somewhere between 10-20k different time-series (24 dimensional data -- a column for each hour of the day) and I'm interested in clustering time series that exhibit roughly the same patterns of activity.

I had originally started to implement Dynamic Time Warping (DTW) because:

  1. Not all of my time series are perfectly aligned
  2. Two slightly shifted time series for my purposes should be considered similar
  3. Two time series with the same shape but different scales should be considered similar

The only problem I had run into with DTW was that it did not appear to scale well -- fastdtw on a 500x500 distance matrix took ~30 minutes.

What other methods exist that would help me satisfy conditions 2 & 3?


回答1:


ARIMA can do the job, if you decompose the time series into trend, seasonality and residuals. After that, use a K-Nearest Neighbor algorithm. However, computational cost may be expensive, basically due to ARIMA.

In ARIMA:

from statsmodels.tsa.arima_model import ARIMA

model0 = ARIMA(X, dates=None,order=(2,1,0))
model1 = model0.fit(disp=1)

decomposition = seasonal_decompose(np.array(X).reshape(len(X),),freq=100)
### insert your data seasonality in 'freq'

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

As a complement to @Sushant comment, you decompose the time series and can check for similarity in one or all of the 4 plots: data, seasonality, trend and residuals.

Then an example of data:

import numpy as np
import matplotlib.pyplot as plt
sin1=[np.sin(x)+x/7 for x in np.linspace(0,30*3,14*2,1)]
sin2=[np.sin(0.8*x)+x/5 for x in np.linspace(0,30*3,14*2,1)]
sin3=[np.sin(1.3*x)+x/5 for x in np.linspace(0,30*3,14*2,1)]
plt.plot(sin1,label='sin1')
plt.plot(sin2,label='sin2')
plt.plot(sin3,label='sin3')
plt.legend(loc=2)
plt.show()

X=np.array([sin1,sin2,sin3])

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
distances

You will get the similarity:

array([[ 0.        , 16.39833107],
       [ 0.        ,  5.2312092 ],
       [ 0.        ,  5.2312092 ]])


来源:https://stackoverflow.com/questions/58358110/clustering-similar-time-series

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!