I want to use Spark Structured Streaming to read from a secure kafka. This means that I will need to force a specific group.id. However, as is stated in the documentation th
According to the Structured Kafka Integration Guide you can provide the ConsumerGroup as an option kafka.group.id:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.option("kafka.group.id", "myConsumerGroup")
.load()
However, Spark will not commit any offsets back so the offsets of your ConsumerGroups will not be stored in Kafka's internal topic __consumer_offsets but rather in Spark's checkpoint files.
Being able to set the group.id is meant to deal with Kafka's latest feature Authorization using Role-Based Access Control for which your ConsumerGroup usually needs to follow naming conventions.
A full example of a Spark 3.x application setting kafka.group.id is discussed and solved here.