When are accumulators truly reliable?

后端 未结 3 1281
醉梦人生
醉梦人生 2020-12-02 09:12

I want to use an accumulator to gather some stats about the data I\'m manipulating on a Spark job. Ideally, I would do that while the job computes the required transformatio

3条回答
  •  青春惊慌失措
    2020-12-02 09:25

    I think Matei answered this in the referred documentation:

    As discussed on https://github.com/apache/spark/pull/2524 this is pretty hard to provide good semantics for in the general case (accumulator updates inside non-result stages), for the following reasons:

    • An RDD may be computed as part of multiple stages. For example, if you update an accumulator inside a MappedRDD and then shuffle it, that might be one stage. But if you then call map() again on the MappedRDD, and shuffle the result of that, you get a second stage where that map is pipeline. Do you want to count this accumulator update twice or not?

    • Entire stages may be resubmitted if shuffle files are deleted by the periodic cleaner or are lost due to a node failure, so anything that tracks RDDs would need to do so for long periods of time (as long as the RDD is referenceable in the user program), which would be pretty complicated to implement.

    So I'm going to mark this as "won't fix" for now, except for the part for result stages done in SPARK-3628.

提交回复
热议问题