How to fix “java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord” in Spark Streaming Kafka Consumer?

后端 未结 3 853
挽巷
挽巷 2020-12-10 03:57
  • Spark 2.0.0
  • Apache Kafka 0.10.1.0
  • scala 2.11.8

When I use spark streaming and kafka integration with kafka broker version 0.10.1

相关标签:
3条回答
  • 2020-12-10 04:12

    KafkaUtils.createDirectStream creates as a org.apache.spark.streaming.dstream.DStream. It is not a RDD. Spark Streaming will create RDDs temporarily as is runs. To retrieve an RDD use stream.foreach() to get the RDD and then RDD.foreach to get each object in the RDD. Those will be Kafka ConsumerRecords of which you use use the value() method to read the message from the Kafka topic:

    stream.foreachRDD { rdd => 
        rdd.foreach { record => 
        val value = record.value()
        println(map.get(value)) 
        }
    }
    
    0 讨论(0)
  • 2020-12-10 04:14

    ConsumerRecord does not implement serialization, when performing operations that require serialization, ie persist or window, print. You need to add the below config to avoid the error.

        sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerialize");
        sparkConf.registerKryoClasses((Class<ConsumerRecord>[] )Arrays.asList(ConsumerRecord.class).toArray());
    
    0 讨论(0)
  • 2020-12-10 04:25

    The Consumer record object is received from Dstream. When you try to print it, it gives error because that object is not serailizable. Instead you should get values from ConsumerRecord object and print it.

    instead of stream.print(), do:

    stream.map(record=>(record.value().toString)).print
    

    This should solve your problem.

    GOTCHA

    For anyone else seeing this exception, any call to checkpoint will call a persist with storageLevel = MEMORY_ONLY_SER, so don't call checkpoint until you call map

    0 讨论(0)
提交回复
热议问题