I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Alth
You can set spark.driver.maxResultSize to 2GB when you start the pyspark shell:
pyspark --conf "spark.driver.maxResultSize=2g"
This is for allowing 2Gb for spark.driver.maxResultSize
You can set spark.driver.maxResultSize
parameter in the SparkConf
object:
from pyspark import SparkConf, SparkContext
# In Jupyter you have to stop the current context first
sc.stop()
# Create new config
conf = (SparkConf()
.set("spark.driver.maxResultSize", "2g"))
# Create new context
sc = SparkContext(conf=conf)
You should probably create a new SQLContext
as well:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Tuning spark.driver.maxResultSize
is a good practice considering the running environment. However, it is not the solution to your problem as the amount of data may change time by time. As @Zia-Kayani mentioned, it is better to collect data wisely. So if you have a DataFrame df
, then you can call df.rdd
and do all the magic stuff on the cluster, not in the driver. However, if you need to collect the data, I would suggest:
spark.sql.parquet.binaryAsString
. String objects take more spacespark.rdd.compress
to compress RDDs when you collect them
long count = df.count() int limit = 50; while(count > 0){ df1 = df.limit(limit); df1.show(); //will print 50, next 50, etc rows df = df.except(df1); count = count - limit; }
Looks like you are collecting the RDD, So it will definitely collect all the data to driver node that's why you are facing this issue.
You have to avoid collect data if not required for a rdd, or if its necessary then specify spark.driver.maxResultSize
. there are two ways of defining this variable
1 - create Spark Config by setting this variable as
conf.set("spark.driver.maxResultSize", "3g")
2 - or set this variable inspark-defaults.conf
file present in conf folder of spark. likespark.driver.maxResultSize 3g
and restart the spark.
while starting the job or terminal, you can use
--conf spark.driver.maxResultSize="0"
to remove the bottleneck
From the command line, such as with pyspark, --conf spark.driver.maxResultSize=3g
can also be used to increase the max result size.