pyspark Linear Regression Example from official documentation - Bad results?

只谈情不闲聊 提交于 2019-11-29 16:49:47

For starters you're missing an intercept. While mean values of the independent variables are close to zero:

parsedData.map(lambda lp: lp.features).mean()
## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
##     -0.0294, 0.0669]

mean of the dependent variable is pretty far from it:

parsedData.map(lambda lp: lp.label).mean()
## 2.452345085074627

Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD performs with default arguments and added intercept:

model = LinearRegressionWithSGD.train(parsedData, intercept=True)
valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
## 0.44005904185432504

Lets compare it to the analytical solution

import numpy as np
from sklearn import linear_model

features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
labels = np.array(parsedData.map(lambda lp: lp.label).collect())

lm = linear_model.LinearRegression()
lm.fit(features, labels)
np.mean((lm.predict(features) - labels) ** 2)
## 0.43919976805833411

As you can results obtained using LinearRegressionWithSGD are almost optimal.

You could add a grid search but in this particular case there is probably nothing to gain.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!