Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端未结

关注

 5  1328

予麋鹿 2020-12-09 06:30

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

5条回答

星月不相逢 (楼主)

2020-12-09 07:21

One very important suggestion , that i have used in similar scenarios is if some movie

relation     similarity score
A-> B        8/10
B->C         7/10
C->D         9/10

If 

E-> A       4  //less that some threshold or hyperparameter
Don't calculate similarity for
E-> B
E-> C 
E->D

0 讨论(0)

查看其它5个回答