collect() or toPandas() on a large DataFrame in pyspark/EMR

后端未结

关注

 3  604

醉梦人生 2020-11-30 12:41

I have an EMR cluster of one machine \"c3.8xlarge\", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using p

3条回答

时光取名叫无心 (楼主)

2020-11-30 13:29
By using arrow setting u will see a speedup
```
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...