I would like to do some DBSCAN on Spark. I have currently found 2 implementations:
You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy
combined with either mapGroups
or flatMapGroups
in the most direct way and you would run dbscan
there. Here's an example:
import smile.clustering._
val dataset: Array[Array[Double]] = Array(
Array(100, 100),
Array(101, 100),
Array(100, 101),
Array(100, 100),
Array(101, 100),
Array(100, 101),
Array(0, 0),
Array(1, 0),
Array(1, 2),
Array(1, 1)
)
val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
println(dbscanResult)
// output
DBSCAN clusters of 10 data points:
0 6 (60.0%)
1 4 (40.0%)
Noise 0 ( 0.0%)
You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.
I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.
I was inspired by the following article to do this