spark 2.1.0 session config settings (pyspark)

前端 未结 5 1117
情深已故
情深已故 2020-12-12 16:27

I am trying to overwrite the spark session/spark context default configs, but it is picking entire node/cluster resource.

 spark  = SparkSession.builder
            


        
相关标签:
5条回答
  • 2020-12-12 16:47

    You aren't actually overwriting anything with this code. Just so you can see for yourself try the following.

    As soon as you start pyspark shell type:

    sc.getConf().getAll()
    

    This will show you all of the current config settings. Then try your code and do it again. Nothing changes.

    What you should do instead is create a new configuration and use that to create a SparkContext. Do it like this:

    conf = pyspark.SparkConf().setAll([('spark.executor.memory', '8g'), ('spark.executor.cores', '3'), ('spark.cores.max', '3'), ('spark.driver.memory','8g')])
    sc.stop()
    sc = pyspark.SparkContext(conf=conf)
    

    Then you can check yourself just like above with:

    sc.getConf().getAll()
    

    This should reflect the configuration you wanted.

    0 讨论(0)
  • 2020-12-12 16:48

    update configuration in Spark 2.3.1

    To change the default spark configurations you can follow these steps:

    Import the required classes

    from pyspark.conf import SparkConf
    from pyspark.sql import SparkSession
    

    Get the default configurations

    spark.sparkContext._conf.getAll()
    

    Update the default configurations

    conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '4g'), ('spark.app.name', 'Spark Updated Conf'), ('spark.executor.cores', '4'), ('spark.cores.max', '4'), ('spark.driver.memory','4g')])
    

    Stop the current Spark Session

    spark.sparkContext.stop()
    

    Create a Spark Session

    spark = SparkSession.builder.config(conf=conf).getOrCreate()
    
    0 讨论(0)
  • 2020-12-12 16:48

    Setting 'spark.driver.host' to 'localhost' in the config works for me

    spark = SparkSession \
        .builder \
        .appName("MyApp") \
        .config("spark.driver.host", "localhost") \
        .getOrCreate()
    
    0 讨论(0)
  • 2020-12-12 16:55

    I had a very different requirement where I had to check if I am getting parameters of executor and driver memory size and if getting, had to replace config with only changes in executer and driver. Below are the steps:

    1. Import Libraries
    from pyspark.conf import SparkConf
    from pyspark.sql import SparkSession
    
    1. Define Spark and get the default configuration
    spark = (SparkSession.builder
            .master("yarn")
            .appName("experiment") 
            .config("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
            .getOrCreate())
    
    conf = spark.sparkContext._conf.getAll()
    
    1. Check if executor and driver size exists (I am giving here pseudo code 1 conditional check, rest you can create cases) then use the given configuration based on params or skip to the default configuration.
    if executor_mem is not None and driver_mem  is not None:
        conf = spark.sparkContext._conf.setAll([('spark.executor.memory',executor_mem),('spark.driver.memory',driver_mem)])
        spark.sparkContext.stop()
        spark = SparkSession.builder.config(conf=conf).getOrCreate()
    else:
        spark = spark
    

    Don't forget to stop spark context, this will make sure executor and driver memory size have differed as you passed in params. Hope this helps!

    0 讨论(0)
  • 2020-12-12 17:01

    You could also set configuration when you start pyspark, just like spark-submit:

    pyspark --conf property=value
    

    Here is one example

    -bash-4.2$ pyspark
    Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
          /_/
    
    Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
    SparkSession available as 'spark'.
    >>> spark.conf.get('spark.eventLog.enabled')
    'true'
    >>> exit()
    
    
    -bash-4.2$ pyspark --conf spark.eventLog.enabled=false
    Python 3.6.8 (default, Apr 25 2019, 21:02:35) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
          /_/
    
    Using Python version 3.6.8 (default, Apr 25 2019 21:02:35)
    SparkSession available as 'spark'.
    >>> spark.conf.get('spark.eventLog.enabled')
    'false'

    0 讨论(0)
提交回复
热议问题