pyspark.ml pipelines: are custom transformers necessary for basic preprocessing tasks?
Getting started with pyspark.ml and the pipelines API, I find myself writing custom transformers for typical preprocessing tasks in order to use them in a pipeline. Examples: from pyspark.ml import Pipeline, Transformer class CustomTransformer(Transformer): # lazy workaround - a transformer needs to have these attributes _defaultParamMap = dict() _paramMap = dict() _params = dict() class ColumnSelector(CustomTransformer): """Transformer that selects a subset of columns - to be used as pipeline stage""" def __init__(self, columns): self.columns = columns def _transform(self, data): return data