Does the state also gets removed on event timeout with spark structured streaming?

試著忘記壹切 提交于 2021-02-05 09:26:35

问题



Q. Does the state gets timed out and also gets removed at the same time
or
only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout?

I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout.

Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say :

ds.withWatermark("timestamp", "10 seconds")
  .groupByKey(...)
  .mapGroupsWithState(
    GroupStateTimeout.EventTimeTimeout)( //event timed out
    ...)(my_mapping_function)

And in my mapping function say

i am performing some operations based on the existence of the state.
I am checking it as :

//Considering it my_mapping_function for mapGroupsWithState/flatmapGroupsWithState

if(state.hasTimeout){
  println("State has timedout")
  state.remove()
}
else 
{
   val newState = state.getOption match {
                  case Some(s) => 
                               ....//some operations
                  case _ =>
                            println("no state")
                            ..return some state
    
    state.update(newState)

    //set the timeout, Does state also gets removed automatically when state has timed out?
    state.setTimeoutTimestamp(state.getCurrentWatermarkMs, "10 seconds")

}

Now considering an example with watermarking set as (10 seconds):
incoming data (data1) with ts 12 seconds
incoming data (data1) with ts 20 seconds
so watermarking upto here will be (20-10) = 10 seconds

incoming data (data2) with ts 12 seconds
(data2) 's state will timeout at 20 seconds
(As 10 seconds (watermarking time) + 10 seconds(which we have set the additional timeout)

So if incoming data (data1) with ts 20 seconds
lly, incoming data (data1) with ts 30 seconds
lly, incoming data (data1) with ts 40 seconds
Upto here, watermarking now is 20 seconds . (40-10)

so the data2's state is timeout as the last data was upto 12seconds

Q. When data2's state got timed out, does the state only gets timed out or also the state gets removed?


as it didnt printed println("State has timedout")
it printed println("no state").

来源:https://stackoverflow.com/questions/65917336/does-the-state-also-gets-removed-on-event-timeout-with-spark-structured-streamin

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!