Use schema to convert AVRO messages with Spark to DataFrame

故事扮演 提交于 2019-12-03 05:46:17
Tal Joffe

OP probably resolved the issue but for future reference I solved this issue quite generally so thought it might be helpful to post here.

So generally speaking you should convert the Avro schema to a spark StructType and also convert the object you have in your RDD to Row[Any] and then use:

spark.createDataFrame(<RDD[obj] mapped to RDD[Row}>,<schema as StructType>

In order to convert the Avro schema I used spark-avro like so:

SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]

The convertion of the RDD was more tricky.. if your schema is simple you can probably just do a simple map.. something like this:

rdd.map(obj=>{
    val seq = (obj.getName(),obj.getAge()
    Row.fromSeq(seq))
    })

In this example the object has 2 fields name and age.

The important thing is to make sure the elements in the Row will match the order and types of the fields in the StructType from before.

In my perticular case I had a much more complex object which I wanted to handle generically to support future schema changes so my code was much more complex.

the method suggested by OP should also work on some casese but will be hard to imply on complex objects (not primitive or case-class)

another tip is that if you have a class within a class you should convert that class to a Row so that the wrapping class will be converted to something like:

Row(Any,Any,Any,Row,...)

you can also look at the spark-avro project I mentioned earlier on how to convert objects to rows.. I used some of the logic there myself

If someone reading this needs further help ask me in the comments and I'll try to help

Similar problem is solved also here.

Please take a look at this https://github.com/databricks/spark-avro/blob/master/src/test/scala/com/databricks/spark/avro/AvroSuite.scala

So instead of

 val df = rdd.map(message => Injection.injection.invert(message._2).get)
.map(record => User(record.get("firstName").toString,records.get("lastName").toString)).toDF()

you can try this

 val df = spark.read.avro(message._2.get)
RadioLog

I worked on the similar issue, but in Java. So not sure about Scala, but take a look at the library com.databricks.spark.avro.

Ben

For anyone interested in handling this in a way that can handle schema changes without needing to stop and redeploy your spark application (assuming your app logic can handle this) see this question/answer.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!