I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier.
Let\'s assu
The janino
error that you are getting is because depending on the feature set, the generated code becomes larger.
I'd separate the steps into different pipelines and drop the unnecessary features, save the intermediate models like StringIndexer
and OneHotEncoder
and load them while prediction stage, which is also helpful because transformations would be faster for the data that has to be predicted.
Finally, you don't need to keep the feature columns after you run VectorAssembler
stage as it transforms the features into a feature vector
and label
column and that is all you need to run predictions.
Example of Pipeline in Scala with saving of intermediate steps-(Older spark API)
Also, if you are using older version of spark like 1.6.0, you need to check for patched version i.e. 2.1.1 or 2.2.0 or 1.6.4 or else you would hit the Janino
error even with around 400 feature columns.