Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code-

Spark code

partial model_input  
label,AGE,GENDER,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ  
1.0,39,0,0,1,0,0,1,31.55709342560551  
1.0,54,0,0,0,0,0,0,83.38062283737028  
0.0,51,0,1,1,1,0,0,35.61591695501733



def trainModel(df: DataFrame): PipelineModel = {  
  val lr  = new LogisticRegression().setMaxIter(100000).setTol(0.0000000000000001)  
  val pipeline = new Pipeline().setStages(Array(lr))  
  pipeline.fit(df)  
}

val meta =  NominalAttribute.defaultAttr.withName("label").withValues(Array("a", "b")).toMetadata

val assembler = new VectorAssembler().
  setInputCols(Array("AGE","GENDER","DET_AGE_SQ",
 "QA1","QA2","QA3","QA4","QA5")).
  setOutputCol("features")

val model = trainModel(model_input)
val pred= model.transform(model_input)  
pred.filter("label!=prediction").count

R code

lr <- model_input %>% glm(data=., formula=label~ AGE+GENDER+Q1+Q2+Q3+Q4+Q5+DET_AGE_SQ,
          family=binomial)
pred <- data.frame(y=model_input$label,p=fitted(lr))
table(pred $y, pred $p>0.5)

Feel free to let me know if you need any other information. Thank you!

Edit 9/18/2015 I have tried increasing the maximum iteration and decreasing the tolerance dramatically. Unfortunately, it didn't improve the prediction. It seems the model converged to a local minimum instead of the global minimum.

It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood.

Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. Fundamental difference between these two is the way how it is achieved.

All linear models from ML/MLlib are based on some variants of Gradient descent. Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters.

R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets.

As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations.

来源：https://stackoverflow.com/questions/32642789/spark-ml-pipeline-logistic-regression-produces-much-worse-predictions-than-r-glm

标签

scala

apache-spark

apache-spark-ml