Serialize a custom transformer using python to be used within a Pyspark ML pipeline

前端未结

关注

 5  552

醉梦人生 2020-12-01 07:04

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to

5条回答

渐次进展 (楼主)

2020-12-01 07:25

Similar to the working answer by @dmbaker, I wrapped my custom transformer called Aggregator inside of a built-in Spark transformer, in this example, Binarizer, though I'm sure you can inherit from other transformers, too. That allowed my custom transformer to inherit the methods necessary for serialization.

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, Binarizer
from pyspark.ml.regression import LinearRegression    

class Aggregator(Binarizer):
    """A huge hack to allow serialization of custom transformer."""

    def transform(self, input_df):
        agg_df = input_df\
            .groupBy('channel_id')\
            .agg({
                'foo': 'avg',
                'bar': 'avg',
            })\
            .withColumnRenamed('avg(foo)', 'avg_foo')\
            .withColumnRenamed('avg(bar)', 'avg_bar') 
        return agg_df

# Create pipeline stages.
aggregator = Aggregator()
vector_assembler = VectorAssembler(...)
linear_regression = LinearRegression()

# Create pipeline.
pipeline = Pipeline(stages=[aggregator, vector_assembler, linear_regression])

# Train.
pipeline_model = pipeline.fit(input_df)

# Save model file to S3.
pipeline_model.save('s3n://example')

0 讨论(0)

查看其它5个回答