Understanding flink savepoints & checkpoints

问题

Considering an Apache Flink streaming-application with a pipeline like this:

Kafka-Source -> flatMap 1 -> flatMap 2 -> flatMap 3 -> Kafka-Sink

where every flatMap function is a non-stateful operator (e.g. the normal .flatMap function of a Datastream).

How do checkpoints/savepoints work, in case an incoming message will be pending at flatMap 3? Will the message be reprocessed after restart beginning from flatMap 1 or will it skip to flatMap 3?

I am a bit confused, because the documentation seems to refer to application state as what I can use in stateful operators, but I don't have stateful operators in my application. Is the "processing progress" saved & restored at all, or will the whole pipeline be re-processed after a failure/restart?

And this there a difference between a failure (-> flink restores from checkpoint) and manual restart using savepoints regarding my previous questions?

I tried finding out myself (with enabled checkpointing using EXACTLY_ONCE and rocksdb-backend) by placing a Thread.sleep() in flatMap 3 and then cancelling the job with a savepoint. However this lead to the flink commandline tool hanging until the sleep was over, and even then flatMap 3 was executed and even sent out to the sink before the job got cancelled. So it seems I can not manually force this situation to analyze flink's behaviour.

In case "processing progress" is not saved/covered by the checkpointing/savepoints as I described above, how could I make sure for every message reaching my pipeline that any given operator (flatmap 1/2/3) is never re-processed in a restart/failure situation?

回答1:

When a checkpoint is taken, every task (parallel instance of an operator) checkpoints its state. In your example, the three flatmap operators are stateless, so there is no state to be checkpointed. The Kafka source is stateful and checkpoints the reading offsets for all partitions.

In case of a failure, the job is recovered and all tasks load their state which means in case of the source operator that the reading offsets are reset. Hence, the application will reprocess all events since the last checkpoint.

In order to achieve end-to-end exactly-once, you need a special sink connector that offers either transaction support (e.g., for Kafka) or supports idempotent writes.

来源：https://stackoverflow.com/questions/54986886/understanding-flink-savepoints-checkpoints

标签

scala

persistence

apache-flink

failover

checkpoint