Why does Spark save Map phase output to local disk?

后端未结

关注

 2  630

醉话见心 2020-12-28 22:41

I\'m trying to understand spark shuffle process deeply. When i start reading i came across the following point.

Spark writes the Map task(ShuffleMapTa

2条回答

忘掉有多难 (楼主)

2020-12-28 23:26
First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.

It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:
- memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
- shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.
Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.

How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.

Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...