How to deduplicate and keep latest based on timestamp field in spark structured streaming?

问题

Spark dropDuplicates keeps the first instance and ignores all subsequent occurrences for that key. Is it possible to do remove duplicates while keeping the most recent occurrence?

For example if below are the micro batches that I get, then I want to keep the most recent record (sorted on timestamp field) for each country.

batchId: 0

Australia, 10, 2020-05-05 00:00:06
Belarus, 10, 2020-05-05 00:00:06

batchId: 1

Australia, 10, 2020-05-05 00:00:08
Belarus, 10, 2020-05-05 00:00:03

Then output after batchId 1 should be below -

Australia, 10, 2020-05-05 00:00:08
Belarus, 10, 2020-05-05 00:00:06

Update-1 This is the current code that I have

//KafkaDF is a streaming dataframe created from Kafka as source
val streamingDF = kafkaDF.dropDuplicates("country")

streamingDF.writeStream
    .trigger(Trigger.ProcessingTime(10000L))
    .outputMode("update")
    .foreachBatch {
      (batchDF: DataFrame, batchId: Long) => {
        println("batchId: "+ batchId)
        batchDF.show()
      }
    }.start()

I want to output all rows which are either new or have greater timestamp than any record in previous batches processed so far. Example below

After batchId: 0 - Both countries appeared for first time so I should get them in output

Australia, 10, 2020-05-05 00:00:06
Belarus, 10, 2020-05-05 00:00:06

After batchId: 1 - Belarus's timestamp is older than we I received in batch 0 so I don't display that in output. Australia is displayed as its timestamp is more recent than what I have seen so far.

Australia, 10, 2020-05-05 00:00:08

Now let's say batchId 2 comes up with both records as late arrival then it should not display anything in ouput for that batch.

Input batchId: 2

Australia, 10, 2020-05-05 00:00:01
Belarus, 10, 2020-05-05 00:00:01

After batchId: 2

Update-2

Adding input and expected records for each batch. Rows marked with red color are discarded and not shown in output as an another row with same country name and more recent timestamp is seen in previous batches

回答1:

In order to avoid late arriving events in streaming app you need to keep a state in your application, that keeps track of latest processed event per key in your case it is country.

case class AppState(country:String, latestTs:java.sql.Timestamp)

For a microbatch, you might receive multiple events on that when you do groupByKey(_.country) you will get a events belong to a key(country) and you need to compare against it with the state to find the latest input event and update the state with the latest timestamp for the key and proceed with the latest event for further processing. For late arriving events, it should return an Option[Event] and filter out the same in subsequent process.

Refer this blog for detailed explanation.

回答2:

Try to usewindow function in spark streaming, check below for example.

val columns = Seq("country","id").map(col(_))
df.groupBy(window($"timestamp","10 minutes","5 minutes"), columns:_*)

You can also check same in this question, Solution is in python.

来源：https://stackoverflow.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured

标签

apache-spark

spark-streaming

spark-structured-streaming

drop-duplicates