Does Spark write intermediate shuffle outputs to disk

ぃ、小莉子 提交于 2021-01-26 16:46:28

问题


I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:

Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persisted. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.

As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.

When and why will a shuffle persist something on disk? How can that be reused by further computations?


回答1:


When

It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled

Why

This is an optimization. Shuffling is one of the expensive things that happen in Spark.

How that can be reused by further computations?

It is automatically reused with any subsequent action executed on the same RDD.



来源:https://stackoverflow.com/questions/40949835/does-spark-write-intermediate-shuffle-outputs-to-disk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!