Must include log4J, but it is causing errors in Apache Spark shell. How to avoid errors?

问题

Due to a complexity of the jars that I must include into a Spark code, I would like to ask for a help figuring out the way to solve this issue without removing the log4j import.

The simple code is as follows:

    :cp symjar/log4j-1.2.17.jar
import org.apache.spark.rdd._

      val hadoopConf=sc.hadoopConfiguration;
      hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
      hadoopConf.set("fs.s3n.awsAccessKeyId","AKEY")
      hadoopConf.set("fs.s3n.awsSecretAccessKey","SKEY") 
    val numOfProcessors = 2
    val filePath = "s3n://SOMEFILE.csv"
    var rdd = sc.textFile(filePath, numOfProcessors)
    def doStuff(rdd: RDD[String]): RDD[String] = {rdd}
    doStuff(rdd)

First, I am getting this error:

error: error while loading StorageLevel, class file '/root/spark/lib/spark-assembly-1.3.0-hadoop1.0.4.jar(org/apache/spark/storage/StorageLevel.class)' has location not matching its contents: contains class StorageLevel
error: error while loading Partitioner, class file '/root/spark/lib/spark-assembly-1.3.0-hadoop1.0.4.jar(org/apache/spark/Partitioner.class)' has location not matching its contents: contains class Partitioner
error: error while loading BoundedDouble, class file '/root/spark/lib/spark-assembly-1.3.0-hadoop1.0.4.jar(org/apache/spark/partial/BoundedDouble.class)' has location not matching its contents: contains class BoundedDouble
error: error while loading CompressionCodec, class file '/root/spark/lib/spark-assembly-1.3.0-hadoop1.0.4.jar(org/apache/hadoop/io/compress/CompressionCodec.class)' has location not matching its contents: contains class CompressionCodec

Then, I run this line again, and the error dissapears:

var rdd = sc.textFile(filePath, numOfProcessors)

However, the end-result of the code is:

error: type mismatch;
 found   : org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
 required: org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.org.apache.spark.rdd.RDD[String]
              doStuff(rdd)
                      ^

How can I avoid removing the log4j from the import and not get the mentioned errors ? (this is imporant, since the jars that I have use log4j heavily and are in conflict with Spark-Shell).

回答1:

The answer is not to use just the :cp command but to also to add the include everything in .../spark/conf/spark-env.sh under the export SPARK_SUBMIT_CLASSPATH=".../the/path/to/a.jar"

回答2:

Another answer, if using the IDE such as Scala for Eclipse, and maven, is to exclude the jars from maven. For example, I wanted to exclude ommons-codec (and then include a different version as a JAR into a project) and added the changes in a pom.xml as:

...............
             <dependencies>
                            <dependency>
                                <groupId>org.apache.spark</groupId>
                                <artifactId>spark-core_2.11</artifactId>
                                <version>1.3.0</version>
                             <exclusions>
                            <exclusion>
                           <groupId>commons-codec</groupId>
                          <artifactId>commons-codec</artifactId>
                          <version>1.3</version>
                          </exclusion>
                          </exclusions>
                         </dependency>
                        </dependencies>
...............

来源：https://stackoverflow.com/questions/29375027/must-include-log4j-but-it-is-causing-errors-in-apache-spark-shell-how-to-avoid

标签

scala

log4j

apache-spark

type-mismatch