发表新帖

发表新帖

How to read parquet files using `ssc.fileStream()`? What are the types passed to `ssc.fileStream()`?

前端未结

关注

 2  1520

执念已碎 2020-12-16 05:31

My understanding of Spark\'s fileStream() method is that it takes three types as parameters: Key, Value, and Format. In c

2条回答

隐瞒了意图╮ (楼主)

2020-12-16 05:58
My sample to read parquet files in Spark Streaming is below.
```
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
  directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)

val lines = stream.map(row => {
  println("row:" + row.toString())
  row
})
```
Some points are ...
- record type is GenericRecord
- readSupportClass is AvroReadSupport
- pass Configuration to fileStream
- set parquet.read.support.class to the Configuration
I referred to source codes below for creating sample.
And I also could not find good examples.
I would like to wait better one.

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题