可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am looking to use Spark Structured streaming to read data from Kafka and process it and write to Hive table.

 val spark = SparkSession    .builder    .appName("Kafka Test")    .config("spark.sql.streaming.metricsEnabled", true)    .config("spark.streaming.backpressure.enabled", "true")    .enableHiveSupport()    .getOrCreate()  val events = spark   .readStream   .format("kafka")   .option("kafka.bootstrap.servers", "xxxxxxx")   .option("startingOffsets", "latest")   .option("subscribe", "yyyyyy")   .load   val data = events.select(.....some columns...)  data.writeStream   .format("parquet")   .option("compression", "snappy")   .outputMode("append")   .partitionBy("ds")   .option("path", "maprfs:/xxxxxxx")   .start()   .awaitTermination()

This does create a parquet files, however how do I change this to mimic something like, so that it writes into table format which can be read from hive or spark-sql using (select * from)

data.write.format("parquet").option("compression", "snappy").mode("append").partitionBy("ds").saveAsTable("xxxxxx")

回答1:

I would recommend looking at Kafka Connect for writing the data to HDFS. It is open source and available standalone or as part of Confluent Platform.

For filtering and transforming the data you could use Kafka Streams, or KSQL. KSQL runs on top of Kafka Streams, and gives you a very simple way to join data, filter it, and build aggregations.

Here's an example of doing aggregations of streams of data in KSQL

SELECT PAGE_ID,COUNT(*) FROM PAGE_CLICKS WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY PAGE_ID

See KSQL in action in this blog. You might also be interested in this talk about building streaming data pipelines with these components

文章来源: Apache Spark Structured Streaming (DataStreamWriter) write to Hive table

标签

spark

data

option

parquet