Understanding huge shuffle spill sizes in spark

空扰寡人 提交于 2021-02-19 05:58:49

问题


With Spark 2.3 I'm running the following code:

rdd
.persist(DISK_ONLY) // this is 3GB according to storage tab
.groupBy(_.key)
.mapValues(iter => iter.map(x => CaseClass(x._1, x._2)))
.mapValues(iter => func(iter))
  • I have a sql dataframe of 300M rows
  • I convert it to RDD, then persist it: storage tab indicates it's 3GB
  • I do a groupBy. One of my key is receing 100M items, so roughly 1GB if I go by the RDD size
  • I map each item after the shuffle to a case class. This case class only has 2 "double" fields
  • I'm sending the full iterator containing all of a partition's data to a function that will process this stream

What I observe is that the task that is processing the 100M of case class is always failing after 1h+ of processing. In the "aggregated metrics by executor" tab in UI I see HUGE values for "shuffle spill" column, around 10 GB, which is 3 times more than the size of the full RDD. . When I do a thread dump of the slow executor, it seems stuck into write/read to disk operations.

Can somebody tell me what's going on? I understand that 100M of case class instances is probably too big to fit into a single executor's RAM, but I don't understand the following:

1) isn't Spark supposed to "stream" all the instances into my func function? why is it trying to store everything on receiving the executor node?

2) Where does the memory blow-up comes from? I don't understand why serializing 100M case class instances should take around 10GB, which is roughly 100 bytes per item (assuming the data that is spilled to disk is the CaseClass instances, I'm not sure from my job at which point the data is spilled)

来源:https://stackoverflow.com/questions/53622577/understanding-huge-shuffle-spill-sizes-in-spark

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!