DBSCAN on spark : which implementation

后端 未结 4 2239
名媛妹妹
名媛妹妹 2020-12-28 21:09

I would like to do some DBSCAN on Spark. I have currently found 2 implementations:

  • https://github.com/irvingc/dbscan-on-spark
  • https://github.com/alito
4条回答
  •  自闭症患者
    2020-12-28 21:52

    You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy combined with either mapGroups or flatMapGroups in the most direct way and you would run dbscan there. Here's an example:

      import smile.clustering._
    
      val dataset: Array[Array[Double]] = Array(
        Array(100, 100),
        Array(101, 100),
        Array(100, 101),
        Array(100, 100),
        Array(101, 100),
        Array(100, 101),
    
        Array(0, 0),
        Array(1, 0),
        Array(1, 2),
        Array(1, 1)
      )
    
      val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
      println(dbscanResult)
    
      // output
      DBSCAN clusters of 10 data points:
      0     6 (60.0%)
      1     4 (40.0%)
      Noise     0 ( 0.0%)
    

    You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.

    I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.

    I was inspired by the following article to do this

提交回复
热议问题