Spark Structured Streaming Checkpoint Cleanup

后端未结

关注

 2  1568

不要未来只要你来 2021-01-02 04:10

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don\'t understand what will

2条回答

时光取名叫无心 (楼主)

2021-01-02 04:17

After 6 months of running my Structured Streaming app I found some answer I think. The checkpoint files compact together every 10 executions and do continue to grow. Once these compacted files got large ~2gb, there was a noticeable decrease in processing time. So every 10 executions had approximately a 3-5 minute delay. I cleaned up the checkpoint files therefore starting over, and execution time was instantly back to normal.

For my second question, I found that there are essentially two checkpoint locations. The checkpoint folder that is specified and another _spark_metadata in the table directory. Both need to be removed to start over with checkpoint.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...