Spark Structured Streaming File Source Starting Offset

╄→гoц情女王★ 提交于 2019-12-07 16:15:24

问题


Is there a way how to specify starting offset for Spark Structured File Stream Source ?

I am trying to stream parquets from HDFS:

spark.sql("SET spark.sql.streaming.schemaInference=true")

spark.readStream
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()

As I see, the first run is processing all available files detected in path, then save offsets to checkpoint location and process only new files, that is accept age and does not exist in files seen map.

I'm looking for a way, how to specify starting offset or timestamp or number of options to do not process all available files in the first run.

Is there a way I'm looking for?


回答1:


The FileStreamSource has no option to specify a starting offset.

But you could set the option of latestFirst to true to ensure that it processes the latest files first (this option is false by default)

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

 spark.readStream
  .option("latestFirst", true)
  .parquet("/tmp/streaming/")
  .writeStream
  .option("checkpointLocation", "/tmp/streaming-test/checkpoint")
  .format("parquet")
  .option("path", "/tmp/parquet-sink")
  .trigger(Trigger.ProcessingTime(1.minutes))
  .start()



回答2:


Thanks @jayfah, as far as I found, we might simulate Kafka 'latest' starting offsets using following trick:

  1. Run warn-up stream with option("latestFirst", true) and option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge processing time. This way, warm-up stream will save latest file timestamp to checkpoint.

  2. Run real stream with option("maxFileAge", "0"), real sink using the same checkpoint location. In this case stream will process only newly available files.

Most probably that is not necessary for production and there is better way, e.g. reorganize data paths etc., but this way at least I found as answer for my question.



来源:https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!