I am looking to use Spark Structured streaming to read data from Kafka and process it and write to Hive table.
val spark = SparkSession .builder .appName("Kafka Test") .config("spark.sql.streaming.metricsEnabled", true) .config("spark.streaming.backpressure.enabled", "true") .enableHiveSupport() .getOrCreate() val events = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "xxxxxxx") .option("startingOffsets", "latest") .option("subscribe", "yyyyyy") .load val data = events.select(.....some columns...) data.writeStream .format("parquet") .option("compression", "snappy") .outputMode("append") .partitionBy("ds") .option("path", "maprfs:/xxxxxxx") .start() .awaitTermination()
This does create a parquet files, however how do I change this to mimic something like, so that it writes into table format which can be read from hive or spark-sql using (select * from)
data.write.format("parquet").option("compression", "snappy").mode("append").partitionBy("ds").saveAsTable("xxxxxx")