How to setup a starting point for the batchId of foreachBatch?

我们两清 提交于 2019-12-11 14:16:59

问题


The problem that I am facing is that my process relies on the batchId of the foreachBatch as some sort of control of what is ready to the second stage of the pipeline. So it wil only go to the second stage if the first stage (batch) is completed.

I want to guarantee that in case of something goes wrong, the stream can continue from where it stopped.

We tried to do some control by adding all completed batchs to a delta table, however, I couldn't find a way to set the initial batchId.


回答1:


Trying to analyze from what ever information you have provided. May be use some sort of custom checkpointing. For each batch store the offset ranges with batch id and a state column. Keep updating the state to RUNNING/COMPLETED.

If something goes wrong, you check the last batch state if it's not complete you start from that offset else start from incremental offset.




回答2:


I want to guarantee that in case of something goes wrong, the stream can continue from where it stopped.

That's the checkpointLocation option of the foreachBatch sink that is used as a write-ahead log (WAL) in case of problems.

Quoting the official documentation:

Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.

And then it says in Recovering from Failures with Checkpointing:

In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.

I think that covers your use case exactly.


I couldn't find a way to set the initial batchId.

That'd require to use a pre-populated directory with the expected batch ID in the checkpointLocation option of a streaming query.

You could simply create the necessary files yourself and let resumed streaming queries start from the directory.

(I've never tried it out myself before, but looks doable).



来源:https://stackoverflow.com/questions/58955738/how-to-setup-a-starting-point-for-the-batchid-of-foreachbatch

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!