I\'m building a Kafka ingest module in EMR 5.11.1, Spark 2.2.1. My intention is to use Structured Streaming to consume from a Kafka topic, do some processing, and store to E
It turns out that S3 does not support the read-after-write semantics needed by Spark checkpointing.
This article suggests using AWS EFS for checkpointing.
S3 remains a good place to ingest data from, or egest data to.
I solved this question by clearing my checkpoint path:
remove your checkpoint path:
sudo -u hdfs hdfs dfs -rmr ${your_checkpoint_path}
resubmit your spark job.