Limit Kafka batches size when using Spark Streaming

后端 未结 3 1535
独厮守ぢ
独厮守ぢ 2020-12-08 08:15

Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?

I am asking because the first batch I get has hundred of millions

3条回答
  •  南方客
    南方客 (楼主)
    2020-12-08 08:53

    Apart from above answers. Batch size is product of 3 parameters

    1. batchDuration: The time interval at which streaming data will be divided into batches (in Seconds).
    2. spark.streaming.kafka.maxRatePerPartition: set the maximum number of messages per partition per second. This when combined with batchDuration will control the batch size. You want the maxRatePerPartition to be set, and large (otherwise you are effectively throttling your job) and batchDuration to be very small.
    3. No of partitions in kafka topic

    For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)

提交回复
热议问题