I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur
Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
https://datascience.stackexchange.com/questions/14862/spark-item-similarity-recommendation
Notes: