collect() or toPandas() on a large DataFrame in pyspark/EMR

后端 未结 3 604
醉梦人生
醉梦人生 2020-11-30 12:41

I have an EMR cluster of one machine \"c3.8xlarge\", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using p

3条回答
  •  时光取名叫无心
    2020-11-30 13:29

    By using arrow setting u will see a speedup

    spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
    spark.conf.set("spark.sql.execution.arrow.enabled", "true")
    

提交回复
热议问题