Why does spark-ml ALS model returns NaN and negative numbers predictions?

问题

Actually I'm trying to use ALS from spark-ml with implicit ratings.

I noticed that some predictions given by my trained model are negative or NaN, why is it?

回答1:

Apache Spark provides an option to force non negative constraints on ALS.

Thus, to remove these negative values, you'll just need to set :

Python:

nonnegative=True

Scala:

setNonnegative(true)

when creating your ALS model, i.e :

>>> als = ALS(rank=10, maxIter=5, seed=0, nonnegative=True)

Non-negative matrix factorization (NMF or NNMF), also called non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have nonnegative elements [Ref. Wikipedia].

If you want to read more about NMF , I'd recommend reading the following paper :

Collaborative Filtering via Ensembles of Matrix Factorizations

As for NaN values, usually it's due to splitting your dataset which can lead to unseen items or users if one of them isn't present in the training set and for the matter just present in the testing set. This might also happen if you cross validated your training. For the matter, there is a couple of JIRAs that are marked resolved for 2.2 :

https://issues.apache.org/jira/browse/SPARK-14489.
https://issues.apache.org/jira/browse/SPARK-19345.

The latest will allow you set the cold start strategy to use when creating your model.

来源：https://stackoverflow.com/questions/44911349/why-does-spark-ml-als-model-returns-nan-and-negative-numbers-predictions

标签

apache-spark

pyspark

apache-spark-mllib