问题
How can I use foreach
in Python Spark structured streaming to trigger ops on output.
query = wordCounts\
.writeStream\
.outputMode('update')\
.foreach(func)\
.start()
def func():
ops(wordCounts)
回答1:
TL;DR It is not possible to use foreach
method in pyspark.
Quoting the official documentation of Spark Structured Streaming (highlighting mine):
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1, this is available only for Scala and Java.
回答2:
Support for the foreach sink in Python has been added in Spark 2.4.0 and the documentation has been updated: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
Make sure that you have that version and you can now do:
def process_row(row):
# Process row
pass
query = streamingDF.writeStream.foreach(process_row).start()
回答3:
It's impossible to use foreach
in pyspark
using any simple tricks now, besides, in pyspark
, the update
output mode is only ready for debugging.
I'd recommand you to use spark in scala
, it's not hard to learn.
回答4:
You can use DataFrame.foreach(f) instead.
来源:https://stackoverflow.com/questions/48201647/how-to-use-foreach-sink-in-pyspark