How can I perform K-means clustering on time series data?

烂漫一生 提交于 2019-11-29 23:09:35

Time series are usually high-dimensional. And you need specialized distance function to compare them for similarity. Plus, there might be outliers.

k-means is designed for low-dimensional spaces with a (meaningful) euclidean distance. It is not very robust towards outliers, as it puts squared weight on them.

Doesn't sound like a good idea to me to use k-means on time series data. Try looking into more modern, robust clustering algorithms. Many will allow you to use arbitrary distance functions, including time series distances such as DTW.

It's probably too late for an answer, but:

The methods above use R. You'll find more methods by looking, e.g., for "Iterative Incremental Clustering of Time Series".

I have recently come across the kml R package which claims to implement k-means clustering for longitudinal data. I have not tried it out myself.

Also the Time-series clustering - A decade review paper by S. Aghabozorgi, A. S. Shirkhorshidi and T. Ying Wah might be useful to you to seek out alternatives. Another nice paper although somewhat dated is Clustering of time series data-a survey by T. Warren Liao.

If you did really want to use clustering, then dependent on your application you could generate a low dimensional feature vector for each time series. For example, use time series mean, standard deviation, dominant frequency from a Fourier transform etc. This would be suitable for use with k-means, but whether it would give you useful results is dependent on your specific application and the content of your time series.

I don't think k-means is the right way for it either. As @Anony-Mousse suggested you can utilize DTW. In fact, I had the same problem for one of my projects and I wrote my own class for that in Python. The logic is;

  1. Create your all cluster combinations. k is for cluster count and n is for number of series. The number of items returned should be n! / k! / (n-k)!. These would be something like potential centers.
  2. For each series, calculate distances for each center in each cluster groups and assign it to the minimum one.
  3. For each cluster groups, calculate total distance within individual clusters.
  4. Choose the minimum.

And, the Python implementation is here if you're interested.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!