Spark Streaming: Kafka group id not permitted in Spark Structured Streaming

天涯浪子 提交于 2020-07-03 08:09:06

问题


I am writing a Spark structured streaming application in PySpark to read data from Kafka.

However, the current version of Spark is 2.1.0, which does not allow me to set group id as a parameter and will generate a unique id for each query. But the Kafka connection is group-based authorization which requires a pre-set group id.

Hence, is there any workaround to establish the connection without the need to update Spark to 2.2 since my team does not want it.

My Code:

if __name__ == "__main__":
    spark = SparkSession.builder.appName("DNS").getOrCreate()
    sc = spark.sparkContext
    sc.setLogLevel("WARN")

    # Subscribe to 1 topic
    lines = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "host:9092").option("subscribe", "record").option('kafka.security.protocol',"SASL_PLAINTEXT").load()
    print(lines.isStreaming) #print TRUE
    lines.selectExpr("CAST(value AS STRING)")
    # Split the lines into words
    words = lines.select(
    explode(
        split(lines.value, " ")
        ).alias("word")
    )
    # Generate running word count
    wordCounts = words.groupBy("word").count()

    # Start running the query that prints the running counts to the console
    query = wordCounts \
        .writeStream \
        .outputMode("complete") \
        .format("console") \
        .start()

    query.awaitTermination()

回答1:


KafkaUtils class will override the parameter value for "group.id". It will concat "spark-executor-" in from of the orginal group id.

Below is the code from KafkaUtils where is doing this:

// driver and executor should be in different consumer groups
    val originalGroupId = kafkaParams.get(ConsumerConfig.GROUP_ID_CONFIG)
    if (null == originalGroupId) {
      logError(s"${ConsumerConfig.GROUP_ID_CONFIG} is null, you should probably set it")
    }
    val groupId = "spark-executor-" + originalGroupId
    logWarning(s"overriding executor ${ConsumerConfig.GROUP_ID_CONFIG} to ${groupId}")
    kafkaParams.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)

We faced the same problem. Kafka was based on ACL with presets group id, so the only thing was to alter the group id in kafka configuration. Insead of our original group id we put "spark-executor-" + originalGroupId



来源:https://stackoverflow.com/questions/46312709/spark-streaming-kafka-group-id-not-permitted-in-spark-structured-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!