Out of memory error when collecting data out of Spark cluster

徘徊边缘 提交于 2019-12-04 00:30:44

When you say collect on the dataframe there are 2 things happening,

  1. First is all the data has to be written to the output on the driver.
  2. The driver has to collect the data from all nodes and keep in its memory.

Answer:

If you are looking to just load the data into memory of the exceutors, count() is also an action that will load the data into the executor's memory which can be used by other processes.

If you want to extract the data, then try this along with other properties when puling the data "--conf spark.driver.maxResultSize=10g".

As mentioned above, "cache" is not action, check RDD Persistence:

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. 

But "collect" is an action, and all computations (including "cache") will be started when "collect" is called.

You run application in standalone mode, it means, initial data loading and all computations will be performed in the same memory.

Data downloading and other computations are used most memory, not "collect".

You can check it by replacing "collect" with "count".

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!