Getting started with pyspark.ml
and the pipelines API, I find myself writing custom transformers for typical preprocessing tasks in order to use them in a pipeline. Examples:
from pyspark.ml import Pipeline, Transformer
class CustomTransformer(Transformer):
# lazy workaround - a transformer needs to have these attributes
_defaultParamMap = dict()
_paramMap = dict()
_params = dict()
class ColumnSelector(CustomTransformer):
"""Transformer that selects a subset of columns
- to be used as pipeline stage"""
def __init__(self, columns):
self.columns = columns
def _transform(self, data):
return data.select(self.columns)
class ColumnRenamer(CustomTransformer):
"""Transformer renames one column"""
def __init__(self, rename):
self.rename = rename
def _transform(self, data):
(colNameBefore, colNameAfter) = self.rename
return data.withColumnRenamed(colNameBefore, colNameAfter)
class NaDropper(CustomTransformer):
"""
Drops rows with at least one not-a-number element
"""
def __init__(self, cols=None):
self.cols = cols
def _transform(self, data):
dataAfterDrop = data.dropna(subset=self.cols)
return dataAfterDrop
class ColumnCaster(CustomTransformer):
def __init__(self, col, toType):
self.col = col
self.toType = toType
def _transform(self, data):
return data.withColumn(self.col, data[self.col].cast(self.toType))
They work, but I was wondering if this is a pattern or antipattern - are such transformers a good way to work with the pipeline API? Was it necessary to implement them, or is equivalent functionality provided somewhere else?
I'd say it is primarily opinion based, although it looks unnecessarily verbose and Python Transformers
don't integrate well with the rest of the Pipeline
API.
It is also worth pointing out that everything you have here can be easily achieved with SQLTransformer
. For example:
from pyspark.ml.feature import SQLTransformer
def column_selector(columns):
return SQLTransformer(
statement="SELECT {} FROM __THIS__".format(", ".join(columns))
)
or
def na_dropper(columns):
return SQLTransformer(
statement="SELECT * FROM __THIS__ WHERE {}".format(
" AND ".join(["{} IS NOT NULL".format(x) for x in columns])
)
)
With a little bit of effort you can use SQLAlchemy with Hive dialect to avoid handwritten SQL.
来源:https://stackoverflow.com/questions/49734374/pyspark-ml-pipelines-are-custom-transformers-necessary-for-basic-preprocessing