Spark Streaming mapWithState seems to rebuild complete state periodically

前端未结

关注

 2  1575

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.

The state i

相关标签:

2条回答

爱一瞬间的悲伤

2020-12-09 11:40
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.

Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].

This has two performance implications, which occur only once per 10 batches:
1. The traversal and deletion process has its price
2. If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2020-12-09 11:53
Is this a bug in the mapWithState() functionality or is this intended behaviour?

This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.

Here is the relevant piece of code from MapWithStateDStream:
```
override def initialize(time: Time): Unit = {
  if (checkpointDuration == null) {
    checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
  }
  super.initialize(time)
}
```
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
```
private[streaming] object InternalMapWithStateDStream {
  private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
```
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.

This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
0 讨论(0)
发布评论:

提交评论
- 加载中...