I would like to do some DBSCAN on Spark. I have currently found 2 implementations:
You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy combined with either mapGroups or flatMapGroups in the most direct way and you would run dbscan there. Here's an example:
import smile.clustering._
val dataset: Array[Array[Double]] = Array(
Array(100, 100),
Array(101, 100),
Array(100, 101),
Array(100, 100),
Array(101, 100),
Array(100, 101),
Array(0, 0),
Array(1, 0),
Array(1, 2),
Array(1, 1)
)
val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
println(dbscanResult)
// output
DBSCAN clusters of 10 data points:
0 6 (60.0%)
1 4 (40.0%)
Noise 0 ( 0.0%)
You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.
I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.
I was inspired by the following article to do this