How to create a Spark DataFrame inside a custom PySpark ML Pipeline _transform() method?

问题

In Spark's ML Pipelines the transformer's transform() method takes a Spark DataFrame and returns a DataFrame. My custom _transform() method uses the DataFrame that's passed in to create an RDD before processing it. This means the results of my algorithm have to be converted back into a DataFrame before being returned from _transform().

So how should I create the DataFrame from the RDD inside _transform()?

Normally I would use SparkSession.createDataFrame(). But this means passing a SparkSession instance, spark, into my custom Transformer somehow (or a SqlContext object). And this in turn can create other problems such as when trying to use the transformer as a stage in an ML Pipeline.

回答1:

It turns out it's as simple as doing this inside _transform():

yourRdd.toDF(yourSchema)

The schema is optional. I wish I could give you a link to toDF() but it doesn't seem to be included under https://spark.apache.org/docs/2.2.0/api/python/pyspark.html#pyspark.RDD for some reason. Perhaps it's an inherited method?

I also previously tested passing a SparkSession object into my Transformer and calling createDataFrame() on it. It works but it's unnecessary.

来源：https://stackoverflow.com/questions/48643152/how-to-create-a-spark-dataframe-inside-a-custom-pyspark-ml-pipeline-transform

标签

apache-spark

pyspark

apache-spark-ml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!