问题
I have a Static DataFrame
with millions of rows as follows.
Static DataFrame
:
--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
Now in every batch, a Streaming DataFrame
is being formed which contains id and updated time_stamp after some operations like below.
In first Batch :
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------
Now in every batch, I want to update the Static DataFrame with the updated values of Streaming Dataframe like follows. How to do that?
Static DF after first batch :
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
I've already tried except(), union() or 'left_anti' join. But it seems structured streaming doesn't support such operations.
回答1:
So I resolved this issue by Spark 2.4.0 AddBatch method which coverts the streaming Dataframe into mini Batch Dataframes. But for the <2.4.0 version it's still a headache.
回答2:
I have a similar issue. Below is my foreachBatch that i have applied for updating static dataframe. I would like to know how to return the updated df that is done in foreachBatch.
def update_reference_df(df, static_df):
query: StreamingQuery = df \
.writeStream \
.outputMode("append") \
.format("memory") \
.foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
.start()
return query
def update_static_df(batch_df, static_df):
df1: DataFrame = static_df.union(batch_df.join(static_df,
(batch_df.SITE == static_df.SITE)
"left_anti"))
return df1
回答3:
As already explained by Swarup himself, you can use the forEachBatch output sink if you use Spark 2.4.x.
The sink takes a function (batchDF: DataFrame, batchId: Long) => Unit
where batchDF is the currently processed batch of the streaming dataframe and this can be used as a static Dataframe.
So in this function you are able to update the other dataframe with the values of each batch.
See sample below:
Assuming you have a dataframe named frameToBeUpdated
with the same schema as for example an instance variable and you want to keep your state there
df
.writeStream
.outputMode("append")
.foreachBatch((batch: DataFrame, batchId: Long) => {
//batch is a static dataframe
//take all rows from the original frames that aren't in batch and
//union them with the batch, then reassign to the
//dataframe you want to keep
frameToBeUpdated = batch.union(frameToBeUpdated.join(batch, Seq("id"), "left_anti"))
})
.start()
Updating logic is from: spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2
来源:https://stackoverflow.com/questions/53004818/how-to-update-a-static-dataframe-with-streaming-dataframe-in-spark-structured-st