Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端 未结 5 1316
予麋鹿
予麋鹿 2020-12-09 06:30

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

5条回答
  •  时光取名叫无心
    2020-12-09 07:23

    Another thought.. Given that your matrix is relatively small and sparse, it can fit in memory using breeze CSCMatrix[Int].

    Then, you can compute co-occurrences using A'B (A.transposed * B) followed by a TopN selection of the LLR (logLikelyhood ratio) of each pairs. Here, since you keep only 10 top items per row, the output matrix will be very sparse as well.

    You can lookup the details here:

    https://github.com/actionml/universal-recommender

提交回复
热议问题