Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端 未结 5 1328
予麋鹿
予麋鹿 2020-12-09 06:30

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

5条回答
  •  星月不相逢
    2020-12-09 07:21

    One very important suggestion , that i have used in similar scenarios is if some movie

    relation     similarity score
    A-> B        8/10
    B->C         7/10
    C->D         9/10
    
    If 
    
    E-> A       4  //less that some threshold or hyperparameter
    Don't calculate similarity for
    E-> B
    E-> C 
    E->D
    

提交回复
热议问题