Spark MlLib linear regression (Linear least squares) giving random results

前端未结

关注

 2  1171

Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can\'t get this one working:

i found the sample code her

相关标签:

2条回答

余生分开走

2020-12-07 03:01
As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)

So, to fix your problem, change the following line in your code (Pyspark):
```
model = LinearRegressionWithSGD.train(parsedData, numIterations)
```
to
```
model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
```
Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.
0 讨论(0)
发布评论:

提交评论
- 加载中...

孤街浪徒

2020-12-07 03:19

Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.

In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()

// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)

For another example on a more realistic dataset, see

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala

0 讨论(0)