问题
I have a Dataframe in Spark that looks like
eventDF
Sno|UserID|TypeExp
1|JAS123|MOVIE
2|ASP123|GAMES
3|JAS123|CLOTHING
4|DPS123|MOVIE
5|DPS123|CLOTHING
6|ASP123|MEDICAL
7|JAS123|OTH
8|POQ133|MEDICAL
.......
10000|DPS123|OTH
I need to write it to Kafka topic in Avro format currently i am able to write in Kafka as JSON using following code
val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value"))
kafkaUserDF.selectExpr("CAST(value AS STRING)").write.format("kafka")
.option("kafka.bootstrap.servers", "Host:port")
.option("topic", "eventdf")
.save()
Now I want to write this in Avro format to Kafka topic
回答1:
Spark >= 2.4:
You can use to_avro function from spark-avro library.
import org.apache.spark.sql.avro._
eventDF.select(
to_avro(struct(eventDF.columns.map(column):_*)).alias("value")
)
Spark < 2.4
You have to do it the same way:
Create a function which writes serialized Avro record to
ByteArrayOutputStream
and return the result. A naive implementation (this supports only flat objects) could be similar to (adopted from Kafka Avro Scala Example by Sushil Kumar Singh)import org.apache.spark.sql.Row def encode(schema: org.apache.avro.Schema)(row: Row): Array[Byte] = { val gr: GenericRecord = new GenericData.Record(schema) row.schema.fieldNames.foreach(name => gr.put(name, row.getAs(name))) val writer = new SpecificDatumWriter[GenericRecord](schema) val out = new ByteArrayOutputStream() val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null) writer.write(gr, encoder) encoder.flush() out.close() out.toByteArray() }
Convert it to
udf
:import org.apache.spark.sql.functions.udf val schema: org.apache.avro.Schema val encodeUDF = udf(encode(schema) _)
Use it as drop in replacement for
to_json
eventDF.select( encodeUDF(struct(eventDF.columns.map(column):_*)).alias("value") )
来源:https://stackoverflow.com/questions/47951668/spark-dataframe-write-to-kafka-topic-in-avro-format