collect() or toPandas() on a large DataFrame in pyspark/EMR

喜欢而已 提交于 2019-11-27 16:16:19

TL;DR I believe you're seriously underestimating memory requirements.

Even assuming that data is fully cached, storage info will show only a fraction of peak memory required for bringing data back to the driver.

Since data is actually pretty large I would consider writing it to Parquet and reading it back directly in Python using PyArrow (Reading and Writing the Apache Parquet Format) completely skipping all the intermediate stages.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!