发表新帖

发表新帖

DBSCAN on spark : which implementation

后端未结

关注

 4  2239

名媛妹妹 2020-12-28 21:09

I would like to do some DBSCAN on Spark. I have currently found 2 implementations:

https://github.com/irvingc/dbscan-on-spark
https://github.com/alito

4条回答

自闭症患者 (楼主)

2020-12-28 21:52
You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy combined with either mapGroups or flatMapGroups in the most direct way and you would run dbscan there. Here's an example:
```
  import smile.clustering._

  val dataset: Array[Array[Double]] = Array(
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),

    Array(0, 0),
    Array(1, 0),
    Array(1, 2),
    Array(1, 1)
  )

  val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
  println(dbscanResult)

  // output
  DBSCAN clusters of 10 data points:
  0     6 (60.0%)
  1     4 (40.0%)
  Noise     0 ( 0.0%)
```
You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.

I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.

I was inspired by the following article to do this
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...

热议问题