spark-structured-streaming

Why does streaming Dataset fail with “Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets… ”?

阅读更多关于 Why does streaming Dataset fail with “Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets… ”?

问题 I use Spark 2.2.0 and have the following error with Spark Structured Streaming on windows: Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark . 回答1: Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark Streaming aggregations require that you tell the Spark Structured Streaming engine when to output the aggregation (per so-called output mode ) since

Using Spark Structured Streaming to Read Data From Kafka, Issue of Over-time is Always Occured

阅读更多关于 Using Spark Structured Streaming to Read Data From Kafka, Issue of Over-time is Always Occured

问题 Here is the code I used to read data from Kafka By using Spark Structured Streaming, //ss:SparkSession is defined before. import ss.implicits._ val df = ss .readStream .format("kafka") .option("kafka.bootstrap.servers", kafka_server) .option("subscribe", topic_input) .option("startingOffsets", "latest") .option("kafkaConsumer.pollTimeoutMs", "5000") .option("failOnDataLoss", "false") .load() Here is the error code, Caused by: java.util.concurrent.TimeoutException: Cannot fetch record xxxx for

Spark structured streaming app reading from multiple Kafka topics

阅读更多关于 Spark structured streaming app reading from multiple Kafka topics

问题 I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app. I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then

Spark structured streaming app reading from multiple Kafka topics

阅读更多关于 Spark structured streaming app reading from multiple Kafka topics

Spark structured streaming app reading from multiple Kafka topics

阅读更多关于 Spark structured streaming app reading from multiple Kafka topics

Spark structured streaming app reading from multiple Kafka topics

阅读更多关于 Spark structured streaming app reading from multiple Kafka topics

Spark structured streaming app reading from multiple Kafka topics

阅读更多关于 Spark structured streaming app reading from multiple Kafka topics

How to process new files in HDFS directory once their writing has eventually finished?

阅读更多关于 How to process new files in HDFS directory once their writing has eventually finished?

问题 In my scenario I have CSV files continuously uploaded to HDFS. As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet ). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file. I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark. However, in order to process the whole file I would

How to process new files in HDFS directory once their writing has eventually finished?

阅读更多关于 How to process new files in HDFS directory once their writing has eventually finished?

Shutdown spark structured streaming gracefully

阅读更多关于 Shutdown spark structured streaming gracefully

问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and