Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端 未结 5 1320
予麋鹿
予麋鹿 2020-12-09 06:30

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

5条回答
  •  一向
    一向 (楼主)
    2020-12-09 06:56

    Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:

    https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

    https://datascience.stackexchange.com/questions/14862/spark-item-similarity-recommendation

    Notes:

    • Keep in mind that you will always have N^2 values in resulting similarity matrix
    • You will have to concatenate your sparse vectors

提交回复
热议问题