Spark Structured Streaming Checkpoint Cleanup

后端 未结 2 1568

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don\'t understand what will

2条回答
  •  时光取名叫无心
    2021-01-02 04:17

    After 6 months of running my Structured Streaming app I found some answer I think. The checkpoint files compact together every 10 executions and do continue to grow. Once these compacted files got large ~2gb, there was a noticeable decrease in processing time. So every 10 executions had approximately a 3-5 minute delay. I cleaned up the checkpoint files therefore starting over, and execution time was instantly back to normal.

    For my second question, I found that there are essentially two checkpoint locations. The checkpoint folder that is specified and another _spark_metadata in the table directory. Both need to be removed to start over with checkpoint.

提交回复
热议问题