How to setup a starting point for the batchId of foreachBatch?

问题

The problem that I am facing is that my process relies on the batchId of the foreachBatch as some sort of control of what is ready to the second stage of the pipeline. So it wil only go to the second stage if the first stage (batch) is completed.

I want to guarantee that in case of something goes wrong, the stream can continue from where it stopped.

We tried to do some control by adding all completed batchs to a delta table, however, I couldn't find a way to set the initial batchId.

回答1:

Trying to analyze from what ever information you have provided. May be use some sort of custom checkpointing. For each batch store the offset ranges with batch id and a state column. Keep updating the state to RUNNING/COMPLETED.

If something goes wrong, you check the last batch state if it's not complete you start from that offset else start from incremental offset.

回答2:

I want to guarantee that in case of something goes wrong, the stream can continue from where it stopped.

That's the checkpointLocation option of the foreachBatch sink that is used as a write-ahead log (WAL) in case of problems.

Quoting the official documentation:

Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.

And then it says in Recovering from Failures with Checkpointing:

In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the quick example) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.

I think that covers your use case exactly.