Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

后端未结

关注

 5  1321

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a featur

相关标签:

5条回答

一向

2020-12-09 06:56
Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

https://datascience.stackexchange.com/questions/14862/spark-item-similarity-recommendation

Notes:
- Keep in mind that you will always have N^2 values in resulting similarity matrix
- You will have to concatenate your sparse vectors
0 讨论(0)
发布评论:

提交评论
- 加载中...
难免孤独

2020-12-09 07:10
It can be done more efficiently, as long as you are fine with approximations, and don't require exact results (or exact number or results).

Similarly to my answer to Efficient string matching in Apache Spark you can use LSH, with:
- BucketedRandomProjectionLSH to approximate Euclidean distance.
- MinHashLSH to approximate Jaccard Distance.
If feature space is small (or can be reasonably reduced) and each category is relatively small you can also optimize your code by hand:
- explode feature array to generate #features records from a single record.
- Self join result by feature, compute distance and filter out candidates (each pair of records will be compared if and only if they share specific categorical feature).
- Take top records using your current code.
A minimal example would be (consider it to be a pseudocode):
```
import org.apache.spark.ml.linalg._

// This is oversimplified. In practice don't assume only sparse scenario
val indices = udf((v: SparseVector) => v.indices)

val df = Seq(
  (1L, Vectors.sparse(1024, Array(1, 3, 5), Array(1.0, 1.0, 1.0))),
  (2L, Vectors.sparse(1024, Array(3, 8, 12), Array(1.0, 1.0, 1.0))),
  (3L, Vectors.sparse(1024, Array(3, 5), Array(1.0, 1.0))),
  (4L, Vectors.sparse(1024, Array(11, 21), Array(1.0, 1.0))),
  (5L, Vectors.sparse(1024, Array(21, 32), Array(1.0, 1.0)))
).toDF("id", "features")

val possibleMatches = df
  .withColumn("key", explode(indices($"features")))
  .transform(df => df.alias("left").join(df.alias("right"), Seq("key")))

val closeEnough(threshold: Double) = udf((v1: SparseVector, v2: SparseVector) =>  intersectionCosine(v1, v2) > threshold)

possilbeMatches.filter(closeEnough($"left.features", $"right.features")).select($"left.id", $"right.id").distinct
```
Note that both solutions are worth the overhead only if hashing / features are selective enough (and optimally sparse). In the example shown above you'd compare only rows inside set {1, 2, 3} and {4, 5}, never between sets.

However in the worst case scenario (M records, N features) we can make N M² comparisons, instead of M²
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-09 07:17
You can borrow from the idea of locality sensitive hashing. Here is one approach:
- Define a set of hash keys based on your matching requirements. You would use these keys to find potential matches. For example, a possible hash key could be based on the movie actor vector.
- Perform reduce for each key. This will give sets of potential matches. For each of the potential matched set, perform your "exact match". The exact match will produce sets of exact matches.
- Run Connected Component algorithm to perform set merge to get the sets of all exact matches.
I have implemented something similar using the above approach.

Hope this helps.
0 讨论(0)
发布评论:

提交评论
- 加载中...

星月不相逢

2020-12-09 07:21

One very important suggestion , that i have used in similar scenarios is if some movie

relation     similarity score
A-> B        8/10
B->C         7/10
C->D         9/10

If 

E-> A       4  //less that some threshold or hyperparameter
Don't calculate similarity for
E-> B
E-> C 
E->D

0 讨论(0)

时光取名叫无心

2020-12-09 07:23

Another thought.. Given that your matrix is relatively small and sparse, it can fit in memory using breeze CSCMatrix[Int].

Then, you can compute co-occurrences using A'B (A.transposed * B) followed by a TopN selection of the LLR (logLikelyhood ratio) of each pairs. Here, since you keep only 10 top items per row, the output matrix will be very sparse as well.

You can lookup the details here:

https://github.com/actionml/universal-recommender

0 讨论(0)
发布评论:

提交评论
- 加载中...