Apache Spark Structured Streaming (DataStreamWriter) write to Hive table

匿名 (未验证) 提交于 2019-12-03 10:03:01

问题:

I am looking to use Spark Structured streaming to read data from Kafka and process it and write to Hive table.

 val spark = SparkSession    .builder    .appName("Kafka Test")    .config("spark.sql.streaming.metricsEnabled", true)    .config("spark.streaming.backpressure.enabled", "true")    .enableHiveSupport()    .getOrCreate()  val events = spark   .readStream   .format("kafka")   .option("kafka.bootstrap.servers", "xxxxxxx")   .option("startingOffsets", "latest")   .option("subscribe", "yyyyyy")   .load   val data = events.select(.....some columns...)  data.writeStream   .format("parquet")   .option("compression", "snappy")   .outputMode("append")   .partitionBy("ds")   .option("path", "maprfs:/xxxxxxx")   .start()   .awaitTermination() 

This does create a parquet files, however how do I change this to mimic something like, so that it writes into table format which can be read from hive or spark-sql using (select * from)

data.write.format("parquet").option("compression", "snappy").mode("append").partitionBy("ds").saveAsTable("xxxxxx") 

回答1:

I would recommend looking at Kafka Connect for writing the data to HDFS. It is open source and available standalone or as part of Confluent Platform.

For filtering and transforming the data you could use Kafka Streams, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to join data, filter it, and build aggregations.

Here's an example of doing aggregations of streams of data in KSQL

SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID 

See KSQL in action in this blog. You might also be interested in this talk about building streaming data pipelines with these components



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!