Spark structured streaming app reading from multiple Kafka topics

问题

I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app.

I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then split the streams with selects, vs. 1 readStream per topic.

Something like

df = spark.readStream.format("kafka").option("subscribe", "t1,t2,t3")
...
t1df = df.select(...).where("topic = 't1'")...
t2df = df.select(...).where("topic = 't2'")...

vs.

t1df = spark.readStream.format("kafka").option("subscribe", "t1")
t2df = spark.readStream.format("kafka").option("subscribe", "t2")

Is either one more "efficient" than the other? I could not find any documentation about if this makes a difference.

Thanks!

回答1:

Each action requires a full lineage execution. Youre better off separating this into three separate kafka reads. Otherwise you'll read each topic N times, where N is the number of writes.

I'd really recommend against this but if you wanted to put all the topics into the same read then do this:

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  batchDF.persist()
  batchDF.filter().write.format(...).save(...)  // location 1
  batchDF.filter().write.format(...).save(...)  // location 2
  batchDF.unpersist()
}

回答2:

From a resource(Memory and Cores) point of view, there will be a difference If you are running it as multiple streams(multiple drives-executors) on the cluster.

For the first case, you mentioned-

df = spark.readStream.format("kafka").option("subscribe", "t1,t2,t3")... t1df = df.select(...).where("topic = 't1'")... t2df = df.select(...).where("topic = 't2'")...

Considering there will be a driver and 2 executers you have provided to above.

In the second case-

t1df = spark.readStream.format("kafka").option("subscribe", "t1") t2df = spark.readStream.format("kafka").option("subscribe", "t2")

You can run these as different streams- 2 drivers and 2 executors(1 executor each). In second case there will require more memory and cores for extra driver required.

来源：https://stackoverflow.com/questions/55929540/spark-structured-streaming-app-reading-from-multiple-kafka-topics

标签

apache-spark

apache-kafka

spark-structured-streaming