Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

前端未结

关注

 2  1692

庸人自扰 2020-12-25 10:34

I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier.

Let\'s assu

2条回答

既然无缘 (楼主)

2020-12-25 11:09

The janino error that you are getting is because depending on the feature set, the generated code becomes larger.

I'd separate the steps into different pipelines and drop the unnecessary features, save the intermediate models like StringIndexer and OneHotEncoder and load them while prediction stage, which is also helpful because transformations would be faster for the data that has to be predicted.

Finally, you don't need to keep the feature columns after you run VectorAssembler stage as it transforms the features into a feature vector and label column and that is all you need to run predictions.

Example of Pipeline in Scala with saving of intermediate steps-(Older spark API)

Also, if you are using older version of spark like 1.6.0, you need to check for patched version i.e. 2.1.1 or 2.2.0 or 1.6.4 or else you would hit the Janino error even with around 400 feature columns.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...