Kafka Producer NetworkException and Timeout Exceptions

后端 未结 4 2223
我在风中等你
我在风中等你 2020-12-14 20:53

We are getting random NetworkExceptions and TimeoutExceptions in our production environment:

Brok         


        
4条回答
  •  太阳男子
    2020-12-14 21:28

    We have faced similar problem. Many NetworkExceptions in the logs and from time to time TimeoutException.

    Cause

    Once we gathered TCP logs from production it turned out that some of the TCP connections to Kafka brokers (we have 3 broker nodes) were dropped without notifying clients after like 5 minutes of being idle (no FIN flags on TCP layer). When client was trying to re-use this connection after that time, then RST flag was returned. We could easily match those connections resets in TCP logs with NetworkExceptions in application logs.

    As for TimeoutException, we could not do the same matching as by the time we found the cause, this type of error was not occurring anymore. However we confirmed in a separate test that dropping TCP connection could also result in TimeoutException. I guess this is because of the fact that Java Kafka Client is using Java NIO Socket Channel under the hood. All the messages are being buffered and then dispatched once connection is ready. If connection will not be ready within timeout (30 seconds), then messages will expire resulting in TimeoutException.

    Solution

    For us the fix was to reduce connections.max.idle.ms on our clients to 4 minutes. Once we applied it, NetworkExceptions were gone from our logs.

    We are still investigating what is dropping the connections.

    Edit

    The cause of the problem was AWS NAT Gateway which was dropping outgoing connections after 350 seconds.

    https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-troubleshooting.html#nat-gateway-troubleshooting-timeout

提交回复
热议问题