Spark ML Pipeline Logistic Regression Produces Much Worse Predictions Than R GLM

╄→гoц情女王★ 提交于 2019-12-03 22:13:57

It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood.

Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. Fundamental difference between these two is the way how it is achieved.

All linear models from ML/MLlib are based on some variants of Gradient descent. Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters.

R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets.

As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!