spark-structured-streaming

Why does streaming Dataset fail with “Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets… ”?

时光总嘲笑我的痴心妄想 提交于 2020-06-17 08:48:27
问题 I use Spark 2.2.0 and have the following error with Spark Structured Streaming on windows: Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark . 回答1: Complete output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark Streaming aggregations require that you tell the Spark Structured Streaming engine when to output the aggregation (per so-called output mode ) since

Using Spark Structured Streaming to Read Data From Kafka, Issue of Over-time is Always Occured

我与影子孤独终老i 提交于 2020-05-14 18:26:08
问题 Here is the code I used to read data from Kafka By using Spark Structured Streaming, //ss:SparkSession is defined before. import ss.implicits._ val df = ss .readStream .format("kafka") .option("kafka.bootstrap.servers", kafka_server) .option("subscribe", topic_input) .option("startingOffsets", "latest") .option("kafkaConsumer.pollTimeoutMs", "5000") .option("failOnDataLoss", "false") .load() Here is the error code, Caused by: java.util.concurrent.TimeoutException: Cannot fetch record xxxx for

Spark structured streaming app reading from multiple Kafka topics

你离开我真会死。 提交于 2020-05-13 04:53:07
问题 I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app. I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then

Spark structured streaming app reading from multiple Kafka topics

笑着哭i 提交于 2020-05-13 04:52:59
问题 I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app. I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then

Spark structured streaming app reading from multiple Kafka topics

一曲冷凌霜 提交于 2020-05-13 04:52:52
问题 I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app. I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then

Spark structured streaming app reading from multiple Kafka topics

你。 提交于 2020-05-13 04:52:34
问题 I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app. I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then

Spark structured streaming app reading from multiple Kafka topics

China☆狼群 提交于 2020-05-13 04:52:08
问题 I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app. I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then

How to process new files in HDFS directory once their writing has eventually finished?

丶灬走出姿态 提交于 2020-04-11 11:41:48
问题 In my scenario I have CSV files continuously uploaded to HDFS. As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet ). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file. I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark. However, in order to process the whole file I would

How to process new files in HDFS directory once their writing has eventually finished?

半世苍凉 提交于 2020-04-11 11:41:09
问题 In my scenario I have CSV files continuously uploaded to HDFS. As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet ). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file. I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark. However, in order to process the whole file I would

Shutdown spark structured streaming gracefully

泪湿孤枕 提交于 2020-03-22 06:56:09
问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and