spark-structured-streaming

How to use foreach or foreachBatch in PySpark to write to database?

跟風遠走 提交于 2020-08-25 07:04:12
问题 I want to do Spark Structured Streaming (Spark 2.4.x) from a Kafka source to a MariaDB with Python (PySpark). I want to use the streamed Spark dataframe and not the static nor Pandas dataframe. It seems that one has to use foreach or foreachBatch since there are no possible database sinks for streamed dataframes according to https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks. Here is my try: from pyspark.sql import SparkSession import pyspark.sql

How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

陌路散爱 提交于 2020-08-24 10:33:59
问题 I have a Static DataFrame with millions of rows as follows. Static DataFrame : -------------- id|time_stamp| -------------- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -------------- Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below. In first Batch : -------------- id|time_stamp| -------------- |1|1540527888| |2|1540525999| |3|1530529784| -------------- Now in every

How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

牧云@^-^@ 提交于 2020-08-24 10:31:14
问题 I have a Static DataFrame with millions of rows as follows. Static DataFrame : -------------- id|time_stamp| -------------- |1|1540527851| |2|1540525602| |3|1530529187| |4|1520529185| |5|1510529182| |6|1578945709| -------------- Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below. In first Batch : -------------- id|time_stamp| -------------- |1|1540527888| |2|1540525999| |3|1530529784| -------------- Now in every

How to manually set group.id and commit kafka offsets in spark structured streaming?

♀尐吖头ヾ 提交于 2020-08-24 06:29:12
问题 I was going through the Spark structured streaming - Kafka integration guide here. It is told at this link that enable.auto.commit: Kafka source doesn’t commit any offset. So how do I manually commit offsets once my spark application has successfully processed each record? 回答1: Current Situation (Spark 2.4.5) This feature seems to be under discussion in the Spark community https://github.com/apache/spark/pull/24613. In that Pull Request you will also find a possible solution for this at https

Is there a way to dynamically stop Spark Structured Streaming?

丶灬走出姿态 提交于 2020-08-19 04:37:22
问题 In my scenario I have several dataSet that comes every now and then that i need to ingest in our platform. The ingestion processes involves several transformation steps. One of them being Spark. In particular I use spark structured streaming so far. The infrastructure also involve kafka from which spark structured streaming reads data. I wonder if there is a way to detect when there is nothing else to consume from a topic for a while to decide to stop the job. That is i want to run it for the

Spark structured streaming - Filter Phoenix table by streaming dataset

青春壹個敷衍的年華 提交于 2020-08-10 19:04:06
问题 I am building a Spark structured streaming job that does the below, Streaming source, val small_df = spark.readStream .format("kafka") .load() small_df.createOrReplaceTempView("small_df") A dataframe - Phoenix load val phoenixDF = spark.read.format("org.apache.phoenix.spark") .option("table", "my_table") .option("zkUrl", "zk") .load() phoenixDF.createOrReplaceTempView("phoenix_tbl") Then, spark sql statement to join(on primary_key) with another small dataframe to filter records. val

Spark structured streaming - Filter Phoenix table by streaming dataset

走远了吗. 提交于 2020-08-10 19:04:06
问题 I am building a Spark structured streaming job that does the below, Streaming source, val small_df = spark.readStream .format("kafka") .load() small_df.createOrReplaceTempView("small_df") A dataframe - Phoenix load val phoenixDF = spark.read.format("org.apache.phoenix.spark") .option("table", "my_table") .option("zkUrl", "zk") .load() phoenixDF.createOrReplaceTempView("phoenix_tbl") Then, spark sql statement to join(on primary_key) with another small dataframe to filter records. val

PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport )

一世执手 提交于 2020-08-06 05:16:09
问题 I am trying to run Python Spark Structured Streaming + Kafka, when I run the command Master@MacBook-Pro spark-3.0.0-preview2-bin-hadoop2.7 % bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5 \ examples/src/main/python/sql/streaming/structured_kafka_wordcount.py \ /Users/Master/Projects/bank_kafka_spark/spark_job1.py localhost:9092 transaction receiving next 20/04/22 13:06:04 WARN Utils: Your hostname, MacBook-Pro.local resolves to a loopback address: 127.0.0.1;

Why does a single structured query run multiple SQL queries per batch?

余生颓废 提交于 2020-08-02 06:34:12
问题 Why does the following structured query run multiple SQL queries as can be seen in web UI's SQL tab? import org.apache.spark.sql.streaming.{OutputMode, Trigger} import scala.concurrent.duration._ val rates = spark. readStream. format("rate"). option("numPartitions", 1). load. writeStream. format("console"). option("truncate", false). option("numRows", 10). trigger(Trigger.ProcessingTime(10.seconds)). queryName("rate-console"). start 来源: https://stackoverflow.com/questions/46162143/why-does-a

Why does a single structured query run multiple SQL queries per batch?

不羁岁月 提交于 2020-08-02 06:33:21
问题 Why does the following structured query run multiple SQL queries as can be seen in web UI's SQL tab? import org.apache.spark.sql.streaming.{OutputMode, Trigger} import scala.concurrent.duration._ val rates = spark. readStream. format("rate"). option("numPartitions", 1). load. writeStream. format("console"). option("truncate", false). option("numRows", 10). trigger(Trigger.ProcessingTime(10.seconds)). queryName("rate-console"). start 来源: https://stackoverflow.com/questions/46162143/why-does-a