问题
I am using Spark 1.5.1 and, In pyspark, after I fit the model using:
model = LogisticRegressionWithLBFGS.train(parsedData)
I can print the prediction using:
model.predict(p.features)
Is there a function to print the probability score also along with the prediction?
回答1:
You have to clear the threshold first, and this works only for binary classification:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint
parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]
model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data))
model.threshold
# 0.5
model.predict(parsed_data[2].features)
# 1
model.clearThreshold()
model.predict(parsed_data[2].features)
# 0.9873840020002339
回答2:
I presume the question is on computing probability score for the predicting the entire training set. if so , I did the following to compute it. Not sure if the post is still active, but this is howI did this:
#get the original training data before it was converted to rows of LabelPoint.
#let us assume it is otd ( of type spark DataFrame)
#let us extract the featureset as rdd by:
fs=otd.rdd.map(lambda x:x[1:]) # assuming label is col 0.
#the below is just a sample way of creating a Labelpoint rows..
parsedData= otd.rdd.map(lambda x: reg.LabeledPoint(int(x[0]-1),x[1:]))
# now convert otd to a panda DataFrame as:
ptd= otd.toPandas()
m= ptd.shape[0]
# train and get the model
model=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)
#Now store the model.predict rdd structures
predict=model.predict(fs)
pr=predict.collect()
correct=0
correct = ((ptd.label-1) == (pr)).sum()
print((correct/m) *100)
Note the above is for multi-class classification.
来源:https://stackoverflow.com/questions/33560820/how-to-print-the-probability-of-prediction-in-logisticregressionwithlbfgs-for-py