Flink + Kafka, java.lang.OutOfMemoryError when parallelism > 1

问题

I've a toy Flink job which reads from 3 kafka topics, then union all these 3 streams. That's all, no extra work.

If using parallelism 1 for my Flink job, everything seems fine, as soos as I change parallelism > 1, it fails with:

java.lang.OutOfMemoryError: Direct buffer memory
    at java.nio.Bits.reserveMemory(Bits.java:693)
    at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
    at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
    at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
    at sun.nio.ch.IOUtil.read(IOUtil.java:195)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
    at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:110)
    at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
    at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
    at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:169)
    at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:150)
    at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:355)
    at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
    at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
    at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
    at org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)

How come it works with parallelism 1 but not with parallelism > 1?

Is it related to kafka sever side setting? Or it's related to comsumer setting in my java code (no special config yet in my code)?

I know that the info proviced here may not be sufficient, but I'm not able to touch the kafka cluster. I just hope that some guru may happen to run into same error before, and can share with me some suggestions.

I'm using kafka 0.10, flink 1.5.

Many thanks.

回答1:

As you can see in the error logs this error is from your Kafka cluster. This issue occurs when the Direct Buffer Memory of the Kafka Broker exceeds the heap size assigned to the JVM. The Direct Buffer Memory is allocated from the heap of a JVM as required by the application. When you use parallelism > 1, multiple Flink tasks, min(Number of Flink Slots, Number of Kafka partitions) will consume data from Kafka at the same time, result in more use of Kafka brokers Heap size in comparison to when parallelism equals to one and the so-called error will happen. The standard solution is to increase the heap size available to the Kafka Brokers by adding the KAFKA_HEAP_OPTS variable to the Kafka env file or as OS environment variable. For example, add the following line to set the heap size to 2 GB:

export KAFKA_HEAP_OPTS="-Xms2G -Xmx2G"

But in your case which there is no access to Kafka broker (according to your question), you can decrease the number of the record returned in a single call to poll(), so the need for Heap memory in brokers will be decreased. (It's not a standard solution, I recommend that just to disappear the error).

From this answer:

Kafka Consumers handles the data backlog by the following two parameters,

max.poll.interval.ms
The maximum delay between invocations of a poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before the expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member. Default value is 300000.

max.poll.records
The maximum number of records returned in a single call to poll(). The default value is 500.

Ignoring to set the above two parameters according to the requirement could lead to polling of maximum data which the consumer may not be able to handle with the available resources, leading to OutOfMemory or failure to commit the consumer offset at times. Hence, it is always advisable to use the max.poll.records, and max.poll.interval.ms parameters.

So for a test, decrease the value of max.poll.records to for example 250 and check if the error will happen, yet.

Properties properties = new Properties();
properties.setProperty("bootstrap.servers", BOOTSTRAPSERVERS);
properties.setProperty("group.id", ID);
properties.setProperty("key.deserializer", Serializer);
properties.setProperty("value.deserializer", Deserializer);
properties.setProperty("max.poll.records", "250");

FlinkKafkaConsumer08<String> myConsumer =
    new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);

来源：https://stackoverflow.com/questions/55045912/flink-kafka-java-lang-outofmemoryerror-when-parallelism-1

标签

apache-kafka

apache-flink