Spark Streaming mapWithState seems to rebuild complete state periodically

前端 未结 2 1574
予麋鹿
予麋鹿 2020-12-09 11:05

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.

The state i

2条回答
  •  爱一瞬间的悲伤
    2020-12-09 11:40

    In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.

    Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].

    This has two performance implications, which occur only once per 10 batches:

    1. The traversal and deletion process has its price
    2. If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)

提交回复
热议问题