How to convert RDD[GenericRecord] to dataframe in scala?

前端 未结 4 1166
轮回少年
轮回少年 2020-12-11 12:19

I get tweets from kafka topic with Avro (serializer and deserializer). Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord]. Now i want to c

4条回答
  •  醉酒成梦
    2020-12-11 12:51

    A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.

    I used the following to create MySchemaConversions

    package com.databricks.spark.avro
    
    import org.apache.avro.Schema
    import org.apache.avro.generic.GenericRecord
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types.DataType
    
    object MySchemaConversions {
      def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
        SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
    }
    

    And then I used

    val myAvroType = SchemaConverters.toSqlType(schema).dataType
    val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
    

    // unionedResultRdd is unionRDD[GenericRecord]

    var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
     val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
    

    The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).

    object MyObject{
        def myConverter(record: GenericRecord,
            myAvroRecordConverter: (GenericRecord) => Row): Row =
                myAvroRecordConverter.apply(record)
    }
    

提交回复
热议问题