Flink Checkpoint Failure - Checkpoints time out after 10 mins

不问归期 提交于 2021-02-19 04:25:07

问题


We got one or two CheckPoint Failure during processing data every day. The data volume is low, like under 10k, and our interval setting is '2 minutes'. (The reason for processing very slow is we need to sink the data to another API endpoint which take some time to process at the end of flink job, so the time is Streaming data + Sink to external API endpoint).

The root issue is: Checkpoints time out after 10 mins, this caused by the data processing time longer than 10 mins, so the checkpoint time out. We might increase the parallelism to fast the processing, but if the data become bigger, we have to increase the parallelism again, so don't want to use this way.

Suggested solution: I saw someone suggest to set the pause between old and new checkpoint, but I have some question here is, if I set the pause time there, will the new checkpoint missing the state in the pause time?

Aim: How to avoid this issue and record the correct state that doesn't miss any data?

Failed checkpoint: enter image description here

Completed checkpoint: enter image description here

subtask didn't respond enter image description here

Thanks


回答1:


There are several related configuration variables you can set -- such as the checkpoint interval, the pause between checkpoints, and the number of concurrent checkpoints. No combination of these settings will result in data being skipped for checkpointing.

Setting an interval between checkpoints means that Flink won't initiate a new checkpoint until some time has passed since the completion (or failure) of the previous checkpoint -- but this has no effect on the timeout.

Sounds like you should extend the timeout, which you can do like this:

env.getCheckpointConfig().setCheckpointTimeout(n);

where n is measured in milliseconds. See the section of the Flink docs on enabling and configuring checkpointing for more details.



来源:https://stackoverflow.com/questions/55857289/flink-checkpoint-failure-checkpoints-time-out-after-10-mins

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!