I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Alth
Tuning spark.driver.maxResultSize
is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As @Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df
, then you can call df.rdd
and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
spark.sql.parquet.binaryAsString
. String objects take more spacespark.rdd.compress
to compress RDDs when you collect them
long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }