collect() or toPandas() on a large DataFrame in pyspark/EMR

后端未结

关注

 3  625

醉梦人生 2020-11-30 12:41

I have an EMR cluster of one machine \"c3.8xlarge\", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using p

3条回答

佛祖请我去吃肉 (楼主)

2020-11-30 13:34

As mentioned above, when calling toPandas(), all records of the DataFrame are collected to the driver program and hence should be done on a small subset of the data. (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html)

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...