What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib?

前端 未结 3 621
余生分开走
余生分开走 2021-02-05 10:09

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are:

3条回答
  •  半阙折子戏
    2021-02-05 10:53

    In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:

    The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

    It is not there in the later versions, but you can still find it in the Scala source code.

    Anyway, and any unfortunate wording aside, the rawPrecictions in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x)).

    Here is an example with toy data in Pyspark:

    spark.version
    # u'2.2.0'
    
    from pyspark.ml.classification import LogisticRegression
    from pyspark.ml.linalg import Vectors
    from pyspark.sql import Row
    df = sqlContext.createDataFrame([
         (0.0, Vectors.dense(0.0, 1.0)),
         (1.0, Vectors.dense(1.0, 0.0))], 
         ["label", "features"])
    df.show()
    # +-----+---------+
    # |label| features|
    # +-----+---------+
    # |  0.0|[0.0,1.0]|
    # |  1.0|[1.0,0.0]|
    # +-----+---------+
    
    lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
    lr_model = lr.fit(df)
    
    test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                           Row(features=Vectors.dense(0.5, 0.2))]).toDF()
    lr_result = lr_model.transform(test)
    lr_result.show(truncate=False)
    

    Here is the result:

    +---------+----------------------------------------+----------------------------------------+----------+ 
    |features |                          rawPrediction |                            probability |prediction|
    +---------+----------------------------------------+----------------------------------------+----------+ 
    |[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]|      0.0 |
    |[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613]  |      1.0 | 
    +---------+----------------------------------------+----------------------------------------+----------+
    

    Let's now confirm that the logistic function of rawPrediction gives the probability column:

    import numpy as np
    
    x1 = np.array([0.9894187891647654,-0.9894187891647654])
    np.exp(x1)/(1+np.exp(x1))
    # array([ 0.72897311, 0.27102689])
    
    x2 = np.array([-0.9894187891647683,0.9894187891647683])
    np.exp(x2)/(1+np.exp(x2))
    # array([ 0.27102689, 0.72897311])
    

    i.e. this is the case indeed


    So, to summarize regarding all three (3) output columns:

    • rawPrediction is the raw output of the logistic regression classifier (array with length equal to the number of classes)
    • probability is the result of applying the logistic function to rawPrediction (array of length equal to that of rawPrediction)
    • prediction is the argument where the array probability takes its maximum value, and it gives the most probable label (single number)

提交回复
热议问题