Spark Shell Add Multiple Drivers/Jars to Classpath using spark-defaults.conf

问题

We are using Spark-Shell REPL Mode to test various use-cases and connecting to multiple sources/sinks

We need to add custom drivers/jars in spark-defaults.conf file, I have tried to add multiple jars separated by comma

spark.driver.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar 
spark.executor.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar

But its not working, Can anyone please provide details for correct syntax

回答1:

As an example in addition to Prateek's answer, I have had some success by adding the following to the spark-defaults.conf file to be loaded when starting a spark-shell session in client mode.

spark.jars         jars_added/aws-java-sdk-1.7.4.jar,jars_added/hadoop-aws-2.7.3.jar,jars_added/sqljdbc42.jar,jars_added/jtds-1.3.1.jar

Adding the exact line to the spark-defaults.conf file will load the three jar files as long as they are stored in the jars_added folder when spark-shell is run from the specific directory (doing this for me seems to mitigate the need to have the jar files loaded onto the slaves in the specified locations as well). I created the folder 'jars_added' in my $SPARK_HOME directory so whenever I run spark-shell I must run it from this directory (I have not yet worked out how to change the location the spark.jars setting uses as the initial path, it seems to default to the current directory when launching spark-shell). As hinted at by Prateek the jar files need to be comma separated.

I also had to set SPARK_CONF_DIR to $SPARK_HOME/conf (export SPARK_CONF_DIR = "${SPARK_HOME}/conf") for spark-shell to recognise the location of my config file (i.e. spark-defaults.conf). I'm using PuTTY to ssh onto the master.

Just to clarify once I have added the spark.jars jar1, jar2, jar3 to my spark-defaults.conf file I type the following to start my spark-shell session:

cd $SPARK_HOME  //navigate to the spark home directory which contains the jars_added folder
spark-shell

On start up the spark-shell then loads the specified jar files from the jars_added folder

回答2:

Note: Verified in Linux Mint and Spark 3.0.1

If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.

Note: spark-shell and pyspark need to verify.

file: spark-defaults.conf

spark.driver.extraJavaOptions      -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages                 org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1

In the terminal run your job say wordcount.py

spark-submit /path-to-file/wordcount.py

If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages and avro package. Also if you want to include log4j.properties, then use extraJavaOptions.

AppName and master can be provided in 2 way.

use .appName() and .master()
use .conf file

file: hellospark.py

from logger import Log4j
from util import get_spark_app_config
from pyspark.sql import SparkSession

# first approach.
spark = SparkSession.builder \
    .appName('Hello Spark') \
    .master('local[3]') \
    .config("spark.streaming.stopGracefullyOnShutdown", "true") \
    .config("spark.jars.packages", 
                 "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,
                  org.apache.spark:spark-avro_2.12:3.0.1") \
    .config("spark.driver.extraJavaOptions", 
                 "-Dlog4j.configuration=file:log4j.properties "
                 "-Dspark.yarn.app.container.log.dir=app-logs "
                 "-Dlogfile.name=hello-spark") \
    .getOrCreate()

# second approach.
conf = get_spark_app_config()
spark = SparkSession.builder \
    .config(conf=conf)
    .config("spark.jars.packages", 
                 "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
    .getOrCreate()

logger = Log4j(spark)

file: logger.py

from pyspark.sql import SparkSession


class Log4j(object):
    def __init__(self, spark: SparkSession):
        conf = spark.sparkContext.getConf()
        app_name = conf.get("spark.app.name")
        log4j = spark._jvm.org.apache.log4j
        self.logger = log4j.LogManager.getLogger(app_name)

    def warn(self, message):
        self.logger.warn(message)

    def info(self, message):
        self.logger.info(message)

    def error(self, message):
        self.logger.error(message)

    def debug(self, message):
        self.logger.debug(message)

file: util.py

import configparser
from pyspark import SparkConf

def get_spark_app_config(enable_delta_lake=False):
    """
    It will read configuration from spark.conf file to create
    an instance of SparkConf(). Can be used to create
    SparkSession.builder.config(conf=conf).getOrCreate()
    :return: instance of SparkConf()
    """
    spark_conf = SparkConf()
    config = configparser.ConfigParser()
    config.read("spark.conf")

    for (key, value) in config.items("SPARK_APP_CONFIGS"):
        spark_conf.set(key, value))

    if enable_delta_lake:
        for (key, value) in config.items("DELTA_LAKE_CONFIGS"):
            spark_conf.set(key, value)
    return spark_conf

file: spark.conf

[SPARK_APP_CONFIGS]
spark.app.name = Hello Spark
spark.master = local[3]
spark.sql.shuffle.partitions = 3

[DELTA_LAKE_CONFIGS]
spark.jars.packages = io.delta:delta-core_2.12:0.7.0
spark.sql.extensions = io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog = org.apache.spark.sql.delta.catalog.DeltaCatalog

来源：https://stackoverflow.com/questions/57862801/spark-shell-add-multiple-drivers-jars-to-classpath-using-spark-defaults-conf

标签

scala

apache-spark

properties

classpath