Serialize a custom transformer using python to be used within a Pyspark ML pipeline

前端 未结 5 552
醉梦人生
醉梦人生 2020-12-01 07:04

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to

5条回答
  •  渐次进展
    2020-12-01 07:25

    Similar to the working answer by @dmbaker, I wrapped my custom transformer called Aggregator inside of a built-in Spark transformer, in this example, Binarizer, though I'm sure you can inherit from other transformers, too. That allowed my custom transformer to inherit the methods necessary for serialization.

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import VectorAssembler, Binarizer
    from pyspark.ml.regression import LinearRegression    
    
    class Aggregator(Binarizer):
        """A huge hack to allow serialization of custom transformer."""
    
        def transform(self, input_df):
            agg_df = input_df\
                .groupBy('channel_id')\
                .agg({
                    'foo': 'avg',
                    'bar': 'avg',
                })\
                .withColumnRenamed('avg(foo)', 'avg_foo')\
                .withColumnRenamed('avg(bar)', 'avg_bar') 
            return agg_df
    
    # Create pipeline stages.
    aggregator = Aggregator()
    vector_assembler = VectorAssembler(...)
    linear_regression = LinearRegression()
    
    # Create pipeline.
    pipeline = Pipeline(stages=[aggregator, vector_assembler, linear_regression])
    
    # Train.
    pipeline_model = pipeline.fit(input_df)
    
    # Save model file to S3.
    pipeline_model.save('s3n://example')
    

提交回复
热议问题