I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to
Similar to the working answer by @dmbaker, I wrapped my custom transformer called Aggregator
inside of a built-in Spark transformer, in this example, Binarizer
, though I'm sure you can inherit from other transformers, too. That allowed my custom transformer to inherit the methods necessary for serialization.
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, Binarizer
from pyspark.ml.regression import LinearRegression
class Aggregator(Binarizer):
"""A huge hack to allow serialization of custom transformer."""
def transform(self, input_df):
agg_df = input_df\
.groupBy('channel_id')\
.agg({
'foo': 'avg',
'bar': 'avg',
})\
.withColumnRenamed('avg(foo)', 'avg_foo')\
.withColumnRenamed('avg(bar)', 'avg_bar')
return agg_df
# Create pipeline stages.
aggregator = Aggregator()
vector_assembler = VectorAssembler(...)
linear_regression = LinearRegression()
# Create pipeline.
pipeline = Pipeline(stages=[aggregator, vector_assembler, linear_regression])
# Train.
pipeline_model = pipeline.fit(input_df)
# Save model file to S3.
pipeline_model.save('s3n://example')