Does spark keep all elements of an RDD[K,V] for a particular key in a single partition after “groupByKey” even if the data for a key is very huge?

自闭症网瘾萝莉.ら 提交于 2019-12-04 08:08:11

Does spark keep all elements (...) for a particular key in a single partition after groupByKey

Yes, it does. This is a whole point of the shuffle.

the partition for key a can be of size that may not fit in a worker's RAM. In that case what spark will do

Size of a particular partition is not the biggest issue here. Partitions are represented using lazy Iterators and can easily store data which exceeds amount of available memory. The main problem is non-lazy local data structure generated in the process of grouping.

All values for the particular key are stored in memory as a CompactBuffer so a single large group can result in OOM. Even if each record separately fits in memory you may still encounter serious GC issues.

In general:

  • It is safe, although not optimal performance wise, to repartition data where amount of data assigned to a partition exceeds amount of available memory.
  • It is not safe to use PairRDDFunctions.groupByKey in the same situation.

Note: You shouldn't extrapolate this to different implementations of groupByKey though. In particular both Spark Dataset and PySpark RDD.groupByKey use more sophisticated mechanisms.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!