Customize Distance Formular of K-means in Apache Spark Python

问题

Now I'm using K-means for clustering and following this tutorial and API.

But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?

回答1:

In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances.

See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation.

Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore providing a custom metric as a Python function, wouldn't be technically possible without significant changes in the API.

Please note that since Spark 2.4 there are two built-in measures that can be used with pyspark.ml.clustering.KMeans and pyspark.ml.clustering.BisectingKMeans. (see DistanceMeasure Param).

euclidean for Euclidean distance.
cosine for cosine distance.

Use at your own risk.

来源：https://stackoverflow.com/questions/34527287/customize-distance-formular-of-k-means-in-apache-spark-python

标签

apache-spark

k-means

apache-spark-mllib

apache-spark-ml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!