Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端未结

关注

 5  1320

予麋鹿 2020-12-09 06:30

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

5条回答

一向 (楼主)

2020-12-09 06:56
Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

https://datascience.stackexchange.com/questions/14862/spark-item-similarity-recommendation

Notes:
- Keep in mind that you will always have N^2 values in resulting similarity matrix
- You will have to concatenate your sparse vectors
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...