Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can\'t get this one working:
i found the sample code her
As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)
So, to fix your problem, change the following line in your code (Pyspark):
model = LinearRegressionWithSGD.train(parsedData, numIterations)
to
model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.
Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.
In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Build the model
var regression = new LinearRegressionWithSGD().setIntercept(true)
regression.optimizer.setStepSize(0.1)
val model = regression.run(parsedData)
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
For another example on a more realistic dataset, see
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md
https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala