问题
I was trying to understand the concept of the output generated from logistic regression model in Pyspark.
Could anyone please explain the concept behind the rawPrediction field calculation generated from a logistic regression model?
Thanks.
回答1:
In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:
The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
It is not there in the later versions, but you can still find it in the Scala source code.
Anyway, and any unfortunate wording aside, the rawPrecictions in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x)).
Here is an example with toy data:
spark.version
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show()
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
lr_result.show(truncate=False)
Here is the result:
+---------+----------------------------------------+----------------------------------------+----------+
|features | rawPrediction | probability |prediction|
+---------+----------------------------------------+----------------------------------------+----------+
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]| 0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613] | 1.0 |
+---------+----------------------------------------+----------------------------------------+----------+
Let's now confirm that the logistic function of rawPrediction gives the probability column:
import numpy as np
x1 = np.array([0.9894187891647654,-0.9894187891647654])
np.exp(x1)/(1+np.exp(x1))
# array([ 0.72897311, 0.27102689])
x2 = np.array([-0.9894187891647683,0.9894187891647683])
np.exp(x2)/(1+np.exp(x2))
# array([ 0.27102689, 0.72897311])
i.e. this is the case indeed
So, to summarize regarding all three (3) output columns:
rawPredictionis the raw output of the logistic regression classifier (array with length equal to the number of classes)probabilityis the result of applying the logistic function torawPrediction(array of length equal to that ofrawPrediction)predictionis the argument where the arrayprobabilitytakes its maximum value, and it gives the most probable label (single number)
来源:https://stackoverflow.com/questions/48256860/pyspark-2-2-0-concept-behind-raw-predictions-field-of-logistic-regression-model