Jaccard Similarity of an RDD with the help of Spark and Scala without Cartesian?

不羁的心 提交于 2019-12-06 15:55:17

As Cartesian product is an expensive operation on rdd, I tried to solve above problem by using HashingTF and MinHashLSH library present in Spark MLib for finding jaccard similarity. Steps to find Jaccard similarity in rdd "a" mentioned in the question:

  • Convert the rdd into dataframe

     import sparkSession.implicits._  
     val dfA = a.toDF("id", "values")
    
  • Create the feature vector with the help of HashingTF

      val hashingTF = new HashingTF()
     .setInputCol("values").setOutputCol("features").setNumFeatures(1048576)  
    
  • Feature transformation

    val featurizedData = hashingTF.transform(dfA) //Feature Transformation  
    
  • Creating minHash table. More is the value of number of table, more accurate results will be, but high communication cost and run time.

     val mh = new MinHashLSH()
            .setNumHashTables(3) 
            .setInputCol("features")
            .setOutputCol("hashes")
    
  • Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.

      val model = mh.fit(featurizedData)  
      //Approximately joining featurizedData with Jaccard distance smaller 
      //than 0.45
     val dffilter = model.approxSimilarityJoin(featurizedData, featurizedData, 
                    0.45)    
    

Since in spark, we have to do manual optimization in our code like setting of number of partition, setting persist level etc. I have configured these parameters also.

  • Changing storaagelevel from persist() to persist(StorageLevel.MEMORY_AND_DISK), it help me to remove OOM error.
  • Also while doing join operation, re-partitioned the data according to the rdd size. On 16.6 GB data set, while doing simple join operation, I was using 200 partition. On increase it to 600, it also solves my problem related to OOM.

PS: the constant parameters setNumFeatures(1048576) and setNumHashTables(3) are configured while experimenting on 16.6 data set. You can increase or decrease these value according to your data set. Also the number of partition depends upon your data set size. With these optimization, I got my desired results.

Useful links:-
[https://spark.apache.org/docs/2.2.0/ml-features.html#locality-sensitive-hashing]
[https://eng.uber.com/lsh/]
[https://data-flair.training/blogs/limitations-of-apache-spark/]

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!