I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource.
spark = SparkSession.builder
You aren't actually overwriting anything with this code. Just so you can see for yourself try the following.
As soon as you start pyspark shell type:
sc.getConf().getAll()
This will show you all of the current config settings. Then try your code and do it again. Nothing changes.
What you should do instead is create a new configuration and use that to create a SparkContext. Do it like this:
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
sc.stop()
sc = pyspark.SparkContext(conf=conf)
Then you can check yourself just like above with:
sc.getConf().getAll()
This should reflect the configuration you wanted.
update configuration in Spark 2.3.1
To change the default spark configurations you can follow these steps:
Import the required classes
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
Get the default configurations
spark.sparkContext._conf.getAll()
Update the default configurations
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
Stop the current Spark Session
spark.sparkContext.stop()
Create a Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
Setting 'spark.driver.host' to 'localhost' in the config works for me
spark = SparkSession \
.builder \
.appName("MyApp") \
.config("spark.driver.host", "localhost") \
.getOrCreate()
I had a very different requirement where I had to check if I am getting parameters of executor and driver memory size and if getting, had to replace config with only changes in executer and driver. Below are the steps:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("yarn")
.appName("experiment")
.config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
.getOrCreate())
conf = spark.sparkContext._conf.getAll()
if executor_mem is not None and driver_mem is not None:
conf = spark.sparkContext._conf.setAll([('spark.executor.memory',executor_mem),('spark.driver.memory',driver_mem)])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
else:
spark = spark
Don't forget to stop spark context, this will make sure executor and driver memory size have differed as you passed in params. Hope this helps!
You could also set configuration when you start pyspark, just like spark-submit:
pyspark --conf property=value
Here is one example
-bash-4.2$ pyspark
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.2.0
/_/
Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'true'
>>> exit()
-bash-4.2$ pyspark --conf spark.eventLog.enabled=false
Python 3.6.8 (default, Apr 25 2019, 21:02:35)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.2.0
/_/
Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
SparkSession available as 'spark'.
>>> spark.conf.get('spark.eventLog.enabled')
'false'