Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

前端 未结 2 1692
庸人自扰
庸人自扰 2020-12-25 10:34

I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier.

Let\'s assu

2条回答
  •  既然无缘
    2020-12-25 11:09

    The janino error that you are getting is because depending on the feature set, the generated code becomes larger.

    I'd separate the steps into different pipelines and drop the unnecessary features, save the intermediate models like StringIndexer and OneHotEncoder and load them while prediction stage, which is also helpful because transformations would be faster for the data that has to be predicted.

    Finally, you don't need to keep the feature columns after you run VectorAssembler stage as it transforms the features into a feature vector and label column and that is all you need to run predictions.

    Example of Pipeline in Scala with saving of intermediate steps-(Older spark API)

    Also, if you are using older version of spark like 1.6.0, you need to check for patched version i.e. 2.1.1 or 2.2.0 or 1.6.4 or else you would hit the Janino error even with around 400 feature columns.

提交回复
热议问题