Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming?
I am asking because the first batch I get has hundred of millions
Apart from above answers. Batch size is product of 3 parameters
batchDuration
: The time interval at which streaming data will be divided into batches (in Seconds).spark.streaming.kafka.maxRatePerPartition
: set the maximum number of messages per partition per second. This when combined with batchDuration
will control the batch size. You want the maxRatePerPartition
to be set, and large (otherwise you are effectively throttling your job) and batchDuration
to be very small.For better explaination how this product work when backpressure enable/disable (set spark.streaming.kafka.maxRatePerPartition for createDirectStream)