How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

问题

I have a Static DataFrame with millions of rows as follows.

Static DataFrame :

--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below.

In first Batch :

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------

Now in every batch, I want to update the Static DataFrame with the updated values of Streaming Dataframe like follows. How to do that?

Static DF after first batch :

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

I've already tried except(), union() or 'left_anti' join. But it seems structured streaming doesn't support such operations.

回答1:

So I resolved this issue by Spark 2.4.0 AddBatch method which coverts the streaming Dataframe into mini Batch Dataframes. But for the <2.4.0 version it's still a headache.

回答2:

I have a similar issue. Below is my foreachBatch that i have applied for updating static dataframe. I would like to know how to return the updated df that is done in foreachBatch.

def update_reference_df(df, static_df):
    query: StreamingQuery = df \
        .writeStream \
        .outputMode("append") \
        .format("memory") \
        .foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
        .start()
    return query

def update_static_df(batch_df, static_df):
    df1: DataFrame = static_df.union(batch_df.join(static_df,
                                                 (batch_df.SITE == static_df.SITE)
                                                 "left_anti"))

    return df1

回答3:

As already explained by Swarup himself, you can use the forEachBatch output sink if you use Spark 2.4.x.

The sink takes a function (batchDF: DataFrame, batchId: Long) => Unit where batchDF is the currently processed batch of the streaming dataframe and this can be used as a static Dataframe. So in this function you are able to update the other dataframe with the values of each batch.

See sample below: Assuming you have a dataframe named frameToBeUpdated with the same schema as for example an instance variable and you want to keep your state there

df
  .writeStream
  .outputMode("append")
  .foreachBatch((batch: DataFrame, batchId: Long) => {
   //batch is a static dataframe

      //take all rows from the original frames that aren't in batch and 
      //union them with the batch, then reassign to the
      //dataframe you want to keep
      frameToBeUpdated = batch.union(frameToBeUpdated.join(batch, Seq("id"), "left_anti"))
    })
    .start()

Updating logic is from: spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

来源：https://stackoverflow.com/questions/53004818/how-to-update-a-static-dataframe-with-streaming-dataframe-in-spark-structured-st

标签

apache-spark

apache-spark-sql

spark-structured-streaming