Run 3000+ Random Forest Models By Group Using Spark MLlib Scala API

一个人想着一个人 提交于 2019-11-30 05:22:50
zero323

Since you already have separate data frame for each school there is not much to be done here. Since you data frames I assume you want to use ml.classification.RandomForestClassifier. If so you can try something like this:

  1. Extract pipeline logic. Adjust RandomForestClassifier parameters and transformers according to your requirements

    import org.apache.spark.sql.DataFrame
    import org.apache.spark.ml.classification.RandomForestClassifier
    import org.apache.spark.ml.{Pipeline, PipelineModel}
    
    def trainModel(df: DataFrame): PipelineModel = {
       val rf  = new RandomForestClassifier()
       val pipeline = new Pipeline().setStages(Array(rf))
       pipeline.fit(df)
    }
    
  2. Train models on each subset

    val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
    
  3. Save models

    import java.io._
    
    def saveModel(name: String, model: PipelineModel) = {
      val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
      oos.writeObject(model)
      oos.close
    }
    
    schools.zip(bySchoolArrayModels).foreach{
      case (name, model) => saveModel(name, Model)
    }
    
  4. Optional: Since individual subsets are rather small you can try an approach similar to the one I've describe here to submit multiple tasks at the same time.

  5. If you use mllib.tree.model.RandomForestModel you can omit 3. and use model.save directly. Since there seem to be some problems with deserialization (How to deserialize Pipeline model in spark.ml? - as far as I can tell it works just fine but better safe than sorry, I guess) it could be a preferred approach.

Edit

According to the official documentation:

VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type.

Since error indicates your column is a String you should transform it first, for example using StringIndexer.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!