Spark MlLib linear regression (Linear least squares) giving random results

前端 未结 2 1164
被撕碎了的回忆
被撕碎了的回忆 2020-12-07 02:22

Im new in spark and Machine learning in general. I have followed with success some of the Mllib tutorials, i can\'t get this one working:

i found the sample code her

相关标签:
2条回答
  • 2020-12-07 03:01

    As explained by zero323 here, setting the intercept to true will solve the problem. If not set to true, your regression line is forced to go through the origin, which is not appropriate in this case. (Not sure, why this is not included in the sample code)

    So, to fix your problem, change the following line in your code (Pyspark):

    model = LinearRegressionWithSGD.train(parsedData, numIterations)
    

    to

    model = LinearRegressionWithSGD.train(parsedData, numIterations, intercept=True)
    

    Although not mentioned explicitly, this is also why the code from 'selvinsource' in the above question is working. Changing the step size doesn't help much in this example.

    0 讨论(0)
  • 2020-12-07 03:19

    Linear Regression is SGD based and requires tweaking the step size, see http://spark.apache.org/docs/latest/mllib-optimization.html for more details.

    In your example, if you set the step size to 0.1 you get better results (MSE = 0.5).

    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.mllib.regression.LinearRegressionModel
    import org.apache.spark.mllib.regression.LinearRegressionWithSGD
    import org.apache.spark.mllib.linalg.Vectors
    
    // Load and parse the data
    val data = sc.textFile("data/mllib/ridge-data/lpsa.data")
    val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
    }.cache()
    
    // Build the model
    var regression = new LinearRegressionWithSGD().setIntercept(true)
    regression.optimizer.setStepSize(0.1)
    val model = regression.run(parsedData)
    
    // Evaluate model on training examples and compute training error
    val valuesAndPreds = parsedData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
    println("training Mean Squared Error = " + MSE)
    

    For another example on a more realistic dataset, see

    https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md

    https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala

    0 讨论(0)
提交回复
热议问题