pyspark structured streaming write to parquet in batches

拟墨画扇 提交于 2021-02-08 09:51:54

问题


I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe.


回答1:


Here is a parquet sink example:

# parquet sink example
targetParquetHDFS = sourceTopicKAFKA
    .writeStream
    .format("parquet") # can be "orc", "json", "csv", etc.
    .outputMode("append") # can only be "append"
    .option("path", "path/to/destination/dir")
    .partitionBy("col") # if you need to partition
    .trigger(processingTime="...") # "mini-batch" frequency when data is outputed to sink
    .option("checkpointLocation", "path/to/checkpoint/dir") # write-ahead logs for recovery purposes
    .start()
targetParquetHDFS.awaitTermination()

For more specific details:

Kafka Integration: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

SS Programming Guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

added

Ok ... I added some stuff to the response to clarify your question.

SS has a few different Trigger Types:

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers

default: next trigger happens once previous trigger has completed processing

fixed intervals: .trigger(processingTime='10 seconds') so trigger of 10 seconds will fire at 00:10, 00:20, 00:30

one-time: processes all available data at once .trigger(once=True)

continuous / fixed checkpoint interval => best to see programming guide doc

Therefore in your Kafka example SS can process the data on the event-time timestamp at micro-batches via the "default" or "fixed interval" triggers or a "one-time" processing of all the data available in the Kafka source topic.



来源:https://stackoverflow.com/questions/55859868/pyspark-structured-streaming-write-to-parquet-in-batches

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!