Spark Structured Streaming Checkpoint Cleanup

后端未结

关注

 2  1569

I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don\'t understand what will

相关标签:

2条回答

时光取名叫无心

2021-01-02 04:17

After 6 months of running my Structured Streaming app I found some answer I think. The checkpoint files compact together every 10 executions and do continue to grow. Once these compacted files got large ~2gb, there was a noticeable decrease in processing time. So every 10 executions had approximately a 3-5 minute delay. I cleaned up the checkpoint files therefore starting over, and execution time was instantly back to normal.

For my second question, I found that there are essentially two checkpoint locations. The checkpoint folder that is specified and another _spark_metadata in the table directory. Both need to be removed to start over with checkpoint.

0 讨论(0)
发布评论:

提交评论
- 加载中...
夕颜

2021-01-02 04:32

If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up

Structured Streaming keeps a background thread which is responsible for deleting snapshots and deltas of your state, so you shouldn't be concerned about it unless your state is really large and the amount of space you have is small, in which case you can configure the retrained deltas/snapshots Spark stores.

when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested.

I'm not really sure what you mean here, but you should only remove checkpointed data in special cases. Structured Streaming allows you to keep state between version upgrades as long as the stored data type is backwards compatible. I don't really see a good reason for altering the location of your checkpoint or deleting the files manually unless something bad happened.

0 讨论(0)
发布评论:

提交评论
- 加载中...