Use sklearn DBSCAN model to classify new entries

问题

I have a huge "dynamic" dataset and I'm trying to find interesting clusters on it.

After running a lot of different unsupervised clustering algorithms I have found a configuration of DBSCAN which gives coherent results.

I would like to extrapolate the model that DBSCAN creates according to my test data to apply it to other datasets, but without re-running the algorithm. I cannot run the algorithm over the whole dataset cause it would run out of memory, and the model might not make sense to me at a different time as the data is dynamic.

Using sklearn, I have found that other clustering algorithms - like MiniBatchKMeans - have a predict method, but DBSCAN does not.

I understand that for MiniBatchKMeans the centroids uniquely define the model. But such a thing might not exist for DBSCAN.

So my question is: What is the proper way to extrapolate the DBSCAN model? should I train a supervised learning algorithm using the output that DBSCAN gave on my test dataset? or is there something intrinsically belonging to DBSCAN model that can be used to classify new data without re-running the algorithm?

回答1:

Train a classificator based on your model.

DBSCAN is not easy to adapt to new objects, because you would need to eventually adjust minPts. Adding points to DBSCAN can cause clusters to merge, which you probably do not want to happen.

If you consider the clusters found by DBSCAN to be useful, train a classifier to put new instances into the same classes. You now want to perform classification, not rediscover structure.

回答2:

DBSCAN and other 'unsupervised' clustering methods can be used to automatically propagate labels used by classifiers (a 'supervised' machine learning task) in what as known as 'semi-supervised' machine learning. I'll break down the general steps for doing this and cite a series of semi-supervised papers that motivated this approach.

By some means, label a small portion of your data.
Use DBSCAN or other clustering method (e.g. k-nearest neighbors) to cluster your labeled and unlabeled data.
For each cluster, determine the most common label (if any) for members of the cluster. Re-label all members in the cluster to that label. This effectively increased the number of labeled training data.
Train a supervised classifier using the dataset from step 3.

The following papers propose some extensions to this general process to improve classification performance. As a note, all of the following papers have found that k-means is a consistent, efficient, and effective clustering method for semi-supervised learning compared to about a dozen other clustering methods. They then use k-nearest neighbors with a large K value for classification. One paper that specifically covered DBSCAN based clustering is:
- Erman, J., & Arlitt, M. (2006). Traffic classification using clustering algorithms. In Proceedings of the 2006 SIGCOMM workshop on Mining network data (pp. 281–286). https://doi.org/http://doi.acm.org/10.1145/1162678.1162679

NOTE: These papers are listed in chronological order and build upon each other. The 2016 Glennan paper is what you should read if you only want to see the most successful/advanced iteration.

Erman, J., & Arlitt, M. (2006). Traffic classification using clustering algorithms. In Proceedings of the 2006 SIGCOMM workshop on Mining network data (pp. 281–286). https://doi.org/http://doi.acm.org/10.1145/1162678.1162679
Wang, Y., Xiang, Y., Zhang, J., & Yu, S. (2011). A novel semi-supervised approach for network traffic clustering. In 5th International Conference on Network and System Security (NSS) (pp. 169–175). Milan, Italy: IEEE. https://doi.org/10.1109/ICNSS.2011.6059997
Zhang, J., Chen, C., Xiang, Y., & Zhou, W. (2012). Semi-supervised and compound classification of network traffic. In Proceedings - 32nd IEEE International Conference on Distributed Computing Systems Workshops, ICDCSW 2012 (pp. 617–621). https://doi.org/10.1109/ICDCSW.2012.12
Glennan, T., Leckie, C., & Erfani, S. M. (2016). Improved Classification of Known and Unknown Network Traffic Flows Using Semi-supervised Machine Learning. In J. K. Liu & R. Steinfeld (Eds.), Information Security and Privacy: 21st Australasian Conference (Vol. 2, pp. 493–501). Melbourne: Springer International Publishing. https://doi.org/10.1007/978-3-319-40367-0_33

来源：https://stackoverflow.com/questions/29625550/use-sklearn-dbscan-model-to-classify-new-entries

标签

machine-learning

scikit-learn

classification

cluster-analysis

dbscan