spark-structured-streaming

Shutdown spark structured streaming gracefully

阅读更多关于 Shutdown spark structured streaming gracefully

问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

Shutdown spark structured streaming gracefully

阅读更多关于 Shutdown spark structured streaming gracefully

Spark Streaming: Read JSON from Kafka and add event_time

阅读更多关于 Spark Streaming: Read JSON from Kafka and add event_time

问题 I am trying to write a Stateful Spark Structured Streaming job that reads from Kafka. As part of the requirement I need to add 'event_time' to my stream as an additional column. I am trying something like this: val schema = spark.read.json("sample-data/test.json").schema val myStream = sparkSession .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "myTopic") .load() val df = myStream.select(from_json($"value".cast("string"), schema).alias(

How to decode a byte[] of List<Objects> to Dataset<Row> in spark?

阅读更多关于 How to decode a byte[] of List to Dataset in spark?

问题 Me using spark-sql-2.3.1v , kafka with java8 in my project. I am trying to convert topic received byte[] to Dataset at kafka consumer side. Here are the details I have class Company{ String companyName; Integer companyId; } Which I defined as public static final StructType companySchema = new StructType( .add("companyName", DataTypes.StringType) .add("companyId", DataTypes.IntegerType); But message defined as class Message{ private List<Company> companyList; private String messageId; } I

While writing to S3, why I get FileNotFoundException

阅读更多关于 While writing to S3, why I get FileNotFoundException

问题 I'm using Spark-SQL-2.3.1, Kafka, Java 8 in my project, and would like to use AWS-S3 as savage storage. I am writing/storing the consumed data from Kafka topic into S3 bucket as below: ds.writeStream() .format("parquet") .option("path", parquetFileName) .option("mergeSchema", true) .outputMode("append") .partitionBy("company_id") .option("checkpointLocation", checkPtLocation) .trigger(Trigger.ProcessingTime("25 seconds")) .start(); But while writing I am getting a FileNotFoundException Caused

Spark context keeps stopping when trying to start a stream that is subscribed to an instance of cloud karafka

阅读更多关于 Spark context keeps stopping when trying to start a stream that is subscribed to an instance of cloud karafka

问题 I'm trying to subcribe to a kafka topic that's located in the cloud(CloudKarafka). I want to write my stream to the console to test if I'm consuming the messages. However when I start my writestream it just keeps stopping my sparkcontext. I'm not sure if the connection is my problem or my code is the problem. I have consumed from this topic before with Apache Flink and then it was working fine. One thing I noticed is that when I was connecting with Flink instead of Spark I would use the

Unable to set kafka spark consumer configs

阅读更多关于 Unable to set kafka spark consumer configs

问题 Me using spark-sql-2.4.x version of with kafka client. Even after setting the consumer configuration parameter i.e. max.partition.fetch.bytes & max.poll.records it is not being set properly and showing default values as below Dataset<Row> df = sparkSession .readStream() .format("kafka") .option("kafka.bootstrap.servers", server1) .option("subscribe", TOPIC1) .option("includeTimestamp", true) .option("startingOffsets", "latest") .option("max.partition.fetch.bytes", "2097152") // default 1000

how to insert data to HIVE using foreach method in spark structured streaming

阅读更多关于 how to insert data to HIVE using foreach method in spark structured streaming

问题 I try inserting data to HIVE table using foreach method. I use spark 2.3.0. Here is my code df_drop_window.writeStream .foreach(new ForeachWriter[Row]() { override def open(partitionId: Long, epochId: Long): Boolean = true override def process(value: Row): Unit = { println(s">> Processing ${value}") // how to onvert the value as dataframe ? } override def close(errorOrNull: Throwable): Unit = { } }).outputMode("update").start() As you can see above, I want convert the "value" to dataframe and

how to insert data to HIVE using foreach method in spark structured streaming

阅读更多关于 how to insert data to HIVE using foreach method in spark structured streaming

Pyspark Structured streaming processing

阅读更多关于 Pyspark Structured streaming processing

问题 I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successfully made spark read and write from and to kafka however my problem is with the processing part. I have tried the foreach function to capture every row and process it before writing back to kafka however it always only does the foreach part and never writes back to kafka. If i however remove the foreach part from the