Tuning parameters for implicit pyspark.ml ALS matrix factorization model through pyspark.ml CrossValidator

前端 未结 3 1590
臣服心动
臣服心动 2020-11-30 01:37

I\'m trying to tune the parameters of an ALS matrix factorization model that uses implicit data. For this, I\'m trying to use pyspark.ml.tuning.CrossValidator to run through

3条回答
  •  广开言路
    2020-11-30 02:07

    Ignoring technical issues, strictly speaking neither method is correct given the input generated by ALS with implicit feedback.

    • you cannot use RegressionEvaluator because, as you already know, prediction can be interpreted as a confidence value and is represented as a floating point number in range [0, 1] and label column is just an unbound integer. These values are clearly not comparable.
    • you cannot use BinaryClassificationEvaluator because even if the prediction can be interpreted as probability label doesn't represent binary decision. Moreover prediction column has invalid type and couldn't be used directly with BinaryClassificationEvaluator

    You can try to convert one of the columns so input fit the requirements but this is is not really a justified approach from a theoretical perspective and introduces additional parameters which are hard to tune.

    • map label column to [0, 1] range and use RMSE.

    • convert label column to binary indicator with fixed threshold and extend ALS / ALSModel to return expected column type. Assuming threshold value is 1 it could be something like this

      from pyspark.ml.recommendation import *
      from pyspark.sql.functions import udf, col
      from pyspark.mllib.linalg import DenseVector, VectorUDT
      
      class BinaryALS(ALS):
          def fit(self, df):
              assert self.getImplicitPrefs()
              model = super(BinaryALS, self).fit(df)
              return ALSBinaryModel(model._java_obj)
      
      class ALSBinaryModel(ALSModel):
          def transform(self, df):
              transformed = super(ALSBinaryModel, self).transform(df)
              as_vector = udf(lambda x: DenseVector([1 - x, x]), VectorUDT())
              return transformed.withColumn(
                  "rawPrediction", as_vector(col("prediction")))
      
      # Add binary label column
      with_binary = dfCounts.withColumn(
          "label_binary", (col("rating") > 0).cast("double"))
      
      als_binary_model = BinaryALS(implicitPrefs=True).fit(with_binary)
      
      evaluatorB = BinaryClassificationEvaluator(
          metricName="areaUnderROC", labelCol="label_binary")
      
      evaluatorB.evaluate(als_binary_model.transform(with_binary))
      ## 1.0
      

    Generally speaking, material about evaluating recommender systems with implicit feedbacks is kind of missing in textbooks, I suggest you take a read on eliasah's answer about evaluating these kind of recommenders.

提交回复
热议问题