I have an EMR cluster of one machine \"c3.8xlarge\", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using p
As mentioned above, when calling toPandas(), all records of the DataFrame are collected to the driver program and hence should be done on a small subset of the data. (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html)