I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don\'t understand what will
After 6 months of running my Structured Streaming app I found some answer I think. The checkpoint files compact together every 10 executions and do continue to grow. Once these compacted files got large ~2gb, there was a noticeable decrease in processing time. So every 10 executions had approximately a 3-5 minute delay. I cleaned up the checkpoint files therefore starting over, and execution time was instantly back to normal.
For my second question, I found that there are essentially two checkpoint locations. The checkpoint folder that is specified and another _spark_metadata in the table directory. Both need to be removed to start over with checkpoint.