My understanding of Spark\'s fileStream() method is that it takes three types as parameters: Key, Value, and Format. In c
You can access the parquet by adding some parquet specific hadoop settings :
val ssc = new StreamingContext(conf, Seconds(5))
var schema =StructType(Seq(
StructField("a", StringType, nullable = false),
........
))
val schemaJson=schema.json
val fileDir="/tmp/fileDir"
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport") ssc.sparkContext.hadoopConfiguration.set("org.apache.spark.sql.parquet.row.requested_schema", schemaJson)
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_BINARY_AS_STRING.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_BINARY_AS_STRING.key, "false")
val streamRdd = ssc.fileStream[Void, UnsafeRow, ParquetInputFormat[UnsafeRow]](fileDir,(t: org.apache.hadoop.fs.Path) => true, false)
streamRdd.count().print()
ssc.start()
ssc.awaitTermination()
This code was prepared with Spark 2.1.0.