We are getting random NetworkExceptions
and TimeoutExceptions
in our production environment:
Brok
It is a bit tricky to find the root cause, I'll drop my experience on that, hopefully someone may find it useful. In general it can be a network issue or too much network flood in combination with ack=ALL
. Here a diagram that explain the TimeoutException
from Kafka KIP-91 at he time of writing (still applicable till 1.1.0):
Excluding network configuration issues or errors, this are the properties you can adjust depending on your scenario in order to mitigate or solve the problem:
The buffer.memory controls the total memory available to a producer for buffering. If records get sent faster than they can be transmitted to Kafka then and this buffer will get exceeded then additional send calls block up to max.block.ms after then Producer throws a TimeoutException
.
The max.block.ms has already a high value and I do not suggest to further increment it. buffer.memory has the default value of 32MB and depending on you message size you may want to increase it; if necessary increase the jvm heap space.
Retries define how many attempts to resend the record in case of error before giving up. If you are using zero retries you can try to mitigate the problem by increasing this value, beware record order is not guarantee anymore unless you set max.in.flight.requests.per.connection to 1.
Records are sent as soon as the batch size is reached or the linger time is passed, whichever comes first. if batch.size (default 16kb) is smaller than the maximum request size perhaps you should use a higher value. In addition, change linger.ms to a higher value such as 10, 50 or 100 to optimize the use of the batch and the compression. This will cause less flood in the network and optimize compression if you are using it.
There is not an exact answer on this kind of issues since they depends also on the implementation, in my case experimenting with the values above helped.