spark-structured-streaming

Error when connecting spark structured streaming + kafka

别说谁变了你拦得住时间么 提交于 2021-02-11 15:45:49
问题 im trying to connect my structured streaming spark 2.4.5 with kafka, but all the times that im trying this Data Source Provider errors appears. Follow my scala code and my sbt build: import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.streaming.Trigger object streaming_app_demo { def main(args: Array[String]): Unit = { println("Spark Structured Streaming with Kafka Demo Application Started ...") val KAFKA_TOPIC

Why does streaming query with Summarizer fail with “requirement failed: Nothing has been added to this summarizer”?

不羁岁月 提交于 2021-02-10 18:24:01
问题 This is a follow-up to How to generate summary statistics (using Summarizer.metrics) in streaming query? I am running a python script to generate summary statistics of micro-batches of a streaming query. Python code (I am currently running) import sys import json import psycopg2 from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType from pyspark.sql

Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

拟墨画扇 提交于 2021-02-08 12:00:31
问题 This question already has answers here : How to display a streaming DataFrame (as show fails with AnalysisException)? (2 answers) Closed 2 years ago . I am getting some issues while executing spark SQL on top of spark structures streaming. PFA for error. here is my code object sparkSqlIntegration { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredStreaming") .master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work

pyspark structured streaming write to parquet in batches

牧云@^-^@ 提交于 2021-02-08 09:51:55
问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

pyspark structured streaming write to parquet in batches

拟墨画扇 提交于 2021-02-08 09:51:54
问题 I am doing some transformation on the spark structured streaming dataframe. I am storing the transformed dataframe as parquet files in hdfs. Now I want that the write to hdfs should happen in batches instead of transforming the whole dataframe first and then storing the dataframe. 回答1: Here is a parquet sink example: # parquet sink example targetParquetHDFS = sourceTopicKAFKA .writeStream .format("parquet") # can be "orc", "json", "csv", etc. .outputMode("append") # can only be "append"

How to deduplicate and keep latest based on timestamp field in spark structured streaming?

天大地大妈咪最大 提交于 2021-02-08 08:44:17
问题 Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence? For example if below are the micro batches that I get, then I want to keep the most recent record (sorted on timestamp field) for each country. batchId: 0 Australia, 10, 2020-05-05 00:00:06 Belarus, 10, 2020-05-05 00:00:06 batchId: 1 Australia, 10, 2020-05-05 00:00:08 Belarus, 10, 2020-05-05 00:00:03 Then output

Does the state also gets removed on event timeout with spark structured streaming?

試著忘記壹切 提交于 2021-02-05 09:26:35
问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

南笙酒味 提交于 2021-02-04 21:05:17
问题 This question already has an answer here : Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”? (1 answer) Closed 13 days ago . I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML. Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running

Reading schema of streaming Dataframe in Spark Structured Streaming [duplicate]

折月煮酒 提交于 2021-02-04 21:01:35
问题 This question already has an answer here : Why does using cache on streaming Datasets fail with “AnalysisException: Queries with streaming sources must be executed with writeStream.start()”? (1 answer) Closed 13 days ago . I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML. Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running

How to stream data from Kafka topic to Delta table using Spark Structured Streaming

纵饮孤独 提交于 2021-02-04 18:09:05
问题 I'm trying to understand databricks delta and thinking to do a POC using Kafka. Basically the plan is to consume data from Kafka and insert it to the databricks delta table. These are the steps that I did: Create a delta table on databricks. %sql CREATE TABLE hazriq_delta_trial2 ( value STRING ) USING delta LOCATION '/delta/hazriq_delta_trial2' Consume data from Kafka. import org.apache.spark.sql.types._ val kafkaBrokers = "broker1:port,broker2:port,broker3:port" val kafkaTopic = "kafkapoc"