Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端未结

关注

 5  1316

予麋鹿 2020-12-09 06:30

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

5条回答

时光取名叫无心 (楼主)

2020-12-09 07:23

Another thought.. Given that your matrix is relatively small and sparse, it can fit in memory using breeze CSCMatrix[Int].

Then, you can compute co-occurrences using A'B (A.transposed * B) followed by a TopN selection of the LLR (logLikelyhood ratio) of each pairs. Here, since you keep only 10 top items per row, the output matrix will be very sparse as well.

You can lookup the details here:

https://github.com/actionml/universal-recommender

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...