Why is Kafka pull-based instead of push-based? I agree Kafka gives high throughput as I had experienced it, but I don\'t see how Kafka throughput would go down if it were to
Scalability was the major driving factor when we design such systems (pull vs push). Kafka is very scalable. One of the key benefits of Kafka is that it is very easy to add large number of consumers without affecting performance and without down time.
Kafka can handle events at 100k+ per second rate coming from producers. Because Kafka consumers pull data from the topic, different consumers can consume the messages at different pace. Kafka also supports different consumption models. You can have one consumer processing the messages at real-time and another consumer processing the messages in batch mode.
The other reason could be that Kafka was designed not only for single consumers like Hadoop. Different consumers can have diverse needs and capabilities.
Pull-based systems have some deficiencies like resources wasting due to polling regularly. Kafka supports a 'long polling' waiting mode until real data comes through to alleviate this drawback.
Refer to the Kafka documentation which details the particular design decision: Push vs pull
Major points that were in favor of pull are:
The drawback of a pull-based systems (consumers polling for data while there's no data available for them) is alleviated somewhat by a 'long poll' waiting mode until data arrives.
Others have provided answers based on Kafka's documentation but sometimes product documentation should be taken with a grain of salt as an absolute technical reference. For example:
It may be interesting to note that various push-pull messaging systems were developed in the late 1990s as a way to optimize the goodput. The results were never staggering and the system complexity and other factors often outweigh this kind of optimization. I believe this is Jay's point overall about practical performance over real data center networks, not to mention things like the open Internet.