How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:

import java.io._

def saveModel(name: String, model: PipelineModel) = {
  val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
  oos.writeObject(model)
  oos.close
}

schools.zip(bySchoolArrayModels).foreach{
  case (name, model) => saveModel(name, Model)
}

I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.

How to save models to Amazon S3?

One way to save a model to HDFS is as following:

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")

Saved model can then be loaded as:

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

For more details see (ref)

Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.

val logRegModel = LogisticRegressionModel.load("myModel.model")

So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.

That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).

来源：https://stackoverflow.com/questions/32292254/how-to-save-models-from-ml-pipeline-to-s3-or-hdfs

标签

java

scala

apache-spark

apache-spark-mllib

apache-spark-ml