Pyspark transform method that's equivalent to the Scala Dataset#transform method

随声附和 提交于 2020-06-27 17:49:05

问题


The Spark Scala API has a Dataset#transform method that makes it easy to chain custom DataFrame transformations like so:

val weirdDf = df
  .transform(myFirstCustomTransformation)
  .transform(anotherCustomTransformation)

I don't see an equivalent transform method for pyspark in the documentation.

Is there a PySpark way to chain custom transformations?

If not, how can the pyspark.sql.DataFrame class be monkey patched to add a transform method?

Update

The transform method was added to PySpark as of PySpark 3.0.


回答1:


Implementation:

from pyspark.sql.dataframe import DataFrame

def transform(self, f):
    return f(self)

DataFrame.transform = transform

Usage:

spark.range(1).transform(lambda df: df.selectExpr("id * 2"))



回答2:


A Transformer pipeline using SQLTransformer objects (or any other Transformer) is a Spark solution which makes chaining transformations easy. For example:

from pyspark.ml.feature import SQLTransformer
from pyspark.ml import Pipeline, PipelineModel

df = spark.createDataFrame([
    (0, 1.0, 3.0),
    (2, 2.0, 5.0)
], ["id", "v1", "v2"])
sqlTrans = SQLTransformer(
    statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlSelectExpr = SQLTransformer(statement="SELECT *, (id * 2) AS v5 FROM __THIS__")

pipeline = Pipeline(stages=[sqlTrans, sqlSelectExpr])
pipelineModel = pipeline.fit(df)
pipelineModel.transform(df).show()

Another approach to chaining when all the transformations are simple expressions such as above, is to use a single SQLTransformer and string manipulations:

transforms = ['(v1 + v2) AS v3',
              '(v1 * v2) AS v4',
              '(id * 2) AS v5',
              ]
selectExpr = "SELECT *, {} FROM __THIS__".format(",".join(transforms))
sqlSelectExpr = SQLTransformer(statement=selectExpr)
sqlSelectExpr.transform(df).show()

Keep in mind that Spark SQL transformations can be optimized and will be faster than transforms defined as a Python User Defined Function (UDF).



来源:https://stackoverflow.com/questions/46247315/pyspark-transform-method-thats-equivalent-to-the-scala-datasettransform-method

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!