What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

前端未结

关注

 3  651

余生分开走 2021-02-05 10:09

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are:

3条回答

半阙折子戏 (楼主)

2021-02-05 10:53

In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:

The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

It is not there in the later versions, but you can still find it in the Scala source code.

Anyway, and any unfortunate wording aside, the rawPrecictions in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x)).

Here is an example with toy data in Pyspark:

spark.version
# u'2.2.0'

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# |  0.0|[0.0,1.0]|
# |  1.0|[1.0,0.0]|
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)

Here is the result:

+---------+----------------------------------------+----------------------------------------+----------+ 
|features |                          rawPrediction |                            probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+ 
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]|      0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613]  |      1.0 | 
+---------+----------------------------------------+----------------------------------------+----------+

Let's now confirm that the logistic function of rawPrediction gives the probability column:

import numpy as np

x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])

x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])

i.e. this is the case indeed

So, to summarize regarding all three (3) output columns:

rawPrediction is the raw output of the logistic regression classifier (array with length equal to the number of classes)
probability is the result of applying the logistic function to rawPrediction (array of length equal to that of rawPrediction)
prediction is the argument where the array probability takes its maximum value, and it gives the most probable label (single number)

0 讨论(0)

查看其它3个回答