I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur
One very important suggestion , that i have used in similar scenarios is if some movie
relation similarity score
A-> B 8/10
B->C 7/10
C->D 9/10
If
E-> A 4 //less that some threshold or hyperparameter
Don't calculate similarity for
E-> B
E-> C
E->D