Hierarchical Agglomerative clustering in Spark

佐手、 提交于 2019-12-06 14:16:55
Gabe Church

The Bisecting Kmeans Approach

Seems to do a decent job, and runs quite fast in terms of performance. Here is a sample code I wrote for utilizing the Bisecting-Kmeans algorithm in Spark (scala) to get cluster centers from the Iris Data Set (which many people are familiar with). Note: (I use Spark-Notebook for most of my Spark work, it is very similar to Jupyter Notebooks). I bring this up because you will need to create a Spark SQLContext for this example to work, which may differ based on where or how you are accessing Spark.

You can download the Iris.csv to test here

You can download Spark-Notebook here

It is a great tool, which will easily allow you to run a standalone spark cluster. If you want help with it on linux or Mac, I can provide instructions. Once you download it you need to use SBT to compile it... Use the following commands from the base directory sbt, then run

It will be accessible at localhost:9000

Required Imports

import org.apache.spark.sql.types._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.clustering.BisectingKMeans

Method to create sqlContext in Spark-Notebook

import org.apache.spark.sql.SQLContext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

Defining Import Schema

val customSchema = StructType(Array(
StructField("c0", IntegerType, true),
StructField("Sepal_Length", DoubleType, true),
StructField("Sepal_Width", DoubleType, true),
StructField("Petal_Length", DoubleType, true),
StructField("Petal_Width", DoubleType, true),
StructField("Species", StringType, true)))

Making the DF

val iris_df = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.schema(customSchema)
.load("/your/path/to/iris.csv")

Specifying features

val assembler = new 
VectorAssembler().setInputCols(Array("c0","Sepal_Length", "Sepal_Width","Petal_Length","Petal_Width")).setOutputCol("features")
val iris_df_trans = assembler.transform(iris_df)

Model with 3 Clusters (change with .setK)

val bkm = new BisectingKMeans().setK(3).setSeed(1L).setFeaturesCol("features")
val model = bkm.fit(iris_df_trans)

Computing cost

val cost = model.computeCost(iris_df_trans)

Calculating Centers

println(s"Within Set Sum of Squared Errors = $cost")
println("Cluster Centers: ")
val centers = model.clusterCenters
centers.foreach(println)

An Agglomerative Approach

The following provides an Agglomerative hierarchical clustering implementation in Spark which is worth a look, it is not included in the base MLlib like the bisecting Kmeans method and I do not have an example. But it is worth a look for those curious.

Github Project

Youtube of Presentation at Spark-Summit

Slides from Spark-Summit

The only thing I was able to find is divisive hierarchical clustering implemented in Spark ML via bisecting k-means (here: https://spark.apache.org/docs/latest/mllib-clustering.html#bisecting-k-means) I am planning to give it a try.

Have you found/tried anything?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!