spark-structured-streaming

Shutdown spark structured streaming gracefully

情到浓时终转凉″ 提交于 2020-03-22 06:55:56
问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

Shutdown spark structured streaming gracefully

戏子无情 提交于 2020-03-22 06:54:09
问题 There is a way to enable graceful shutdown of spark streaming by setting property spark.streaming.stopGracefullyOnShutdown to true and then kill the process with kill -SIGTERM command. However I don't see such option available for structured streaming (SQLContext.scala). Is the shutdown process different in structured streaming? Or is it simply not implemented yet? 回答1: This feature is not implemented yet. But the write ahead logs of spark structured steaming claims to recover state and

Spark Streaming: Read JSON from Kafka and add event_time

余生颓废 提交于 2020-03-16 10:03:44
问题 I am trying to write a Stateful Spark Structured Streaming job that reads from Kafka. As part of the requirement I need to add 'event_time' to my stream as an additional column. I am trying something like this: val schema = spark.read.json("sample-data/test.json").schema val myStream = sparkSession .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "myTopic") .load() val df = myStream.select(from_json($"value".cast("string"), schema).alias(

How to decode a byte[] of List<Objects> to Dataset<Row> in spark?

放肆的年华 提交于 2020-03-10 04:37:49
问题 Me using spark-sql-2.3.1v , kafka with java8 in my project. I am trying to convert topic received byte[] to Dataset at kafka consumer side. Here are the details I have class Company{ String companyName; Integer companyId; } Which I defined as public static final StructType companySchema = new StructType( .add("companyName", DataTypes.StringType) .add("companyId", DataTypes.IntegerType); But message defined as class Message{ private List<Company> companyList; private String messageId; } I

While writing to S3, why I get FileNotFoundException

别等时光非礼了梦想. 提交于 2020-03-05 00:22:42
问题 I'm using Spark-SQL-2.3.1, Kafka, Java 8 in my project, and would like to use AWS-S3 as savage storage. I am writing/storing the consumed data from Kafka topic into S3 bucket as below: ds.writeStream() .format("parquet") .option("path", parquetFileName) .option("mergeSchema", true) .outputMode("append") .partitionBy("company_id") .option("checkpointLocation", checkPtLocation) .trigger(Trigger.ProcessingTime("25 seconds")) .start(); But while writing I am getting a FileNotFoundException Caused

Spark context keeps stopping when trying to start a stream that is subscribed to an instance of cloud karafka

穿精又带淫゛_ 提交于 2020-03-04 23:08:09
问题 I'm trying to subcribe to a kafka topic that's located in the cloud(CloudKarafka). I want to write my stream to the console to test if I'm consuming the messages. However when I start my writestream it just keeps stopping my sparkcontext. I'm not sure if the connection is my problem or my code is the problem. I have consumed from this topic before with Apache Flink and then it was working fine. One thing I noticed is that when I was connecting with Flink instead of Spark I would use the

Unable to set kafka spark consumer configs

久未见 提交于 2020-02-16 06:47:19
问题 Me using spark-sql-2.4.x version of with kafka client. Even after setting the consumer configuration parameter i.e. max.partition.fetch.bytes & max.poll.records it is not being set properly and showing default values as below Dataset<Row> df = sparkSession .readStream() .format("kafka") .option("kafka.bootstrap.servers", server1) .option("subscribe", TOPIC1) .option("includeTimestamp", true) .option("startingOffsets", "latest") .option("max.partition.fetch.bytes", "2097152") // default 1000

how to insert data to HIVE using foreach method in spark structured streaming

旧城冷巷雨未停 提交于 2020-02-06 10:00:25
问题 I try inserting data to HIVE table using foreach method. I use spark 2.3.0. Here is my code df_drop_window.writeStream .foreach(new ForeachWriter[Row]() { override def open(partitionId: Long, epochId: Long): Boolean = true override def process(value: Row): Unit = { println(s">> Processing ${value}") // how to onvert the value as dataframe ? } override def close(errorOrNull: Throwable): Unit = { } }).outputMode("update").start() As you can see above, I want convert the "value" to dataframe and

how to insert data to HIVE using foreach method in spark structured streaming

可紊 提交于 2020-02-06 10:00:09
问题 I try inserting data to HIVE table using foreach method. I use spark 2.3.0. Here is my code df_drop_window.writeStream .foreach(new ForeachWriter[Row]() { override def open(partitionId: Long, epochId: Long): Boolean = true override def process(value: Row): Unit = { println(s">> Processing ${value}") // how to onvert the value as dataframe ? } override def close(errorOrNull: Throwable): Unit = { } }).outputMode("update").start() As you can see above, I want convert the "value" to dataframe and

Pyspark Structured streaming processing

 ̄綄美尐妖づ 提交于 2020-01-30 11:50:47
问题 I am trying to make a structured streaming application with spark the main idea is to read from a kafka source, process the input, write back to another topic. i have successfully made spark read and write from and to kafka however my problem is with the processing part. I have tried the foreach function to capture every row and process it before writing back to kafka however it always only does the foreach part and never writes back to kafka. If i however remove the foreach part from the