I am ingesting data from a file source using structured streaming. I have a checkpoint setup and it works correctly as far as I can tell except I don\'t understand what will
After 6 months of running my Structured Streaming app I found some answer I think. The checkpoint files compact together every 10 executions and do continue to grow. Once these compacted files got large ~2gb, there was a noticeable decrease in processing time. So every 10 executions had approximately a 3-5 minute delay. I cleaned up the checkpoint files therefore starting over, and execution time was instantly back to normal.
For my second question, I found that there are essentially two checkpoint locations. The checkpoint folder that is specified and another _spark_metadata in the table directory. Both need to be removed to start over with checkpoint.
If my streaming app runs for a long time will the checkpoint files just continue to become larger forever or is it eventually cleaned up
Structured Streaming keeps a background thread which is responsible for deleting snapshots and deltas of your state, so you shouldn't be concerned about it unless your state is really large and the amount of space you have is small, in which case you can configure the retrained deltas/snapshots Spark stores.
when I manually remove or alter the checkpoint folder, or change to a different checkpoint folder no new files are ingested.
I'm not really sure what you mean here, but you should only remove checkpointed data in special cases. Structured Streaming allows you to keep state between version upgrades as long as the stored data type is backwards compatible. I don't really see a good reason for altering the location of your checkpoint or deleting the files manually unless something bad happened.