Writing large DataFrame from PySpark to Kafka runs into timeout

问题

I'm trying to write a data frame which has about 230 million records to a Kafka. More specifically to a Kafka-enable Azure Event Hub, but I'm not sure if that's actually the source of my issue.

EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

This starts up fine and writes about 3-4 million records successfully (and pretty fast) to the queue. But then the job stops after a couple of minutes with messages like those:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 248, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Expiring 61 record(s) for mytopic-18: 32839 ms has passed since last append

org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 8.0 failed 4 times, most recent failure: Lost task 13.3 in stage 8.0 (TID 348, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: The request timed out.

Also, I never see the checkpoint file being created/written to. I also played around with .option("kafka.delivery.timeout.ms", 30000) and different values but that didn't seem to have any effect.

I'm running this in an Azure Databricks cluster version 5.0 (includes Apache Spark 2.4.0, Scala 2.11)

I don't see any errors like throttling on my Event Hub, so that should be ok.

回答1:

Finally figured it out (mostly):

Turns out the default batch size of about 16000 messages was too large for the endpoint. After I set the batch.size parameter to 5000, it worked and is writing at about 700k messages per minute to the Event Hub. Also, the timeout parameter above was wrong and was just being ignored. It is kafka.request.timeout.ms

Only issue is that randomly it still runs in timeouts and apparently starts from the beginning again so that I'm ending up with duplicates. Will open another question for that.

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

来源：https://stackoverflow.com/questions/53765133/writing-large-dataframe-from-pyspark-to-kafka-runs-into-timeout

标签

azure

apache-spark

pyspark

apache-kafka

databricks