LSH Spark stucks forever at approxSimilarityJoin() function

老子叫甜甜 提交于 2019-12-09 02:16:26

It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm

  1. hashes the inputs
  2. joins the 2 datasets on the hashes
  3. computes the jaccard distance using a udf and
  4. filters the dataset with your threshold.

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

The join is probably the slow part here as the data is shuffled. So some things to try:

  1. change your dataframe input partitioning
  2. change spark.sql.shuffle.partitions (the default gives you 200 partitions after a join)
  3. your dataset looks small enough where you could use spark.sql.functions.broadcast(dataset) for a map-side join
  4. Are these vectors sparse or dense? the algorithm works better with sparseVectors.

Of these 4 options 2 and 3 have worked best for me while always using sparseVectors.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!