How to update Spark MatrixFactorizationModel for ALS

后端 未结 2 672
甜味超标
甜味超标 2020-12-08 03:04

I build a simple recommendation system for the MovieLens DB inspired by https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html.

I also hav

相关标签:
2条回答
  • 2020-12-08 03:55

    It seems like you want to be doing some kind of online learning. That's the notion that you're actually updating the model while receiving data. Spark MLLib has limited streaming machine learning options. There's a streaming linear regression and a streaming K-Means.

    Many machine learning problems work just fine with batch solutions, perhaps retraining the model every few hours or days. There are probably strategies for solving this.

    One option could be an ensemble model where you combine the results of your ALS with another model that helps make predictions about unseen movies.

    If you expect to see a lot of previously unseen movies though, collaborative filtering probably doesn't do what you want. If those new movies aren't in the model at all, there's no way for the model to know what other people who watched those liked.

    A better option might be to take a different strategy and try some kind of latent semantic analysis on the movies and model concepts of what a movie is (like genre, themes, etc...), that way new movies with various properties and fit into an existing model, and ratings affect how strongly those properties interact together.

    0 讨论(0)
  • 2020-12-08 03:57

    Edit: the following worked for me because I had implicit feedback ratings and was only interesting in ranking the products for a new user. More details here


    You can actually get predictions for new users using the trained model (without updating it):

    To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor matrix (matrix made of the latent representations of all products, a bunch of vectors of size f) and gives you a score for each product. For new users, the problem is that you don't have access to their latent representation (you only have the full representation of size M (number of different products), but what you can do is use a similarity function to compute a similar latent representation for this new user by multiplying it by the transpose of the product matrix.

    i.e. if you user latent matrix is u and your product latent matrix is v, for user i in the model, you get scores by doing: u_i * v for a new user, you don't have a latent representation, so take the full representation full_u and do: full_u * v^t * v This will approximate the latent factors for the new users and should give reasonable recommendations (if the model already gives reasonable recommendations for existing users)

    To answer the question of training, this allows you to compute predictions for new users without having to do the heavy computation of the model which you can now do only once in a while. So you have you batch processing at night and can still make prediction for new user during the day.

    Note: MLLIB gives you access to the matrix u and v

    0 讨论(0)
提交回复
热议问题