How to convert RDD[GenericRecord] to dataframe in scala?

前端未结

关注

 4  1163

轮回少年 2020-12-11 12:19

I get tweets from kafka topic with Avro (serializer and deserializer). Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord]. Now i want to c

4条回答

隐瞒了意图╮ (楼主)

2020-12-11 13:15
Even though something like this may help you,
```
val stream = ...

val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
     val df = rdd.map(_.toSeq)
              .map(seq=> Row.fromSeq(seq))
              .toDF(col1,col2, ....)

     df
})
```
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
```
val df = ss.readStream
  .format("com.databricks.spark.avro")
  .load("/path/to/files")
```
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.

PS: You may need to import spark-avro
```
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
```
Hope this helped. Cheers
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...