Ignore Spark Cluster Own Jars

问题

I would like to use my own application Spark jars. More in concrete I have one jar of mllib that is not already released that contains a fixed bug of BisectingKMeans. So, my idea is to use it in my spark cluster (in locally it works perfectly).

I've tried many things: extraclasspath, userClassPathFirst, jars option...many options that do not work. My last idea is to use the Shade rule of sbt to change all org.apache.spark.* packages to shadespark.* but when I deploy it is still using the cluster' spark jars.

Any idea?

回答1:

You can try to use the Maven shade plugin to relocate the conflicting packages. This creates a separate namespace for the newer version of the mllib jar. So both the old and the new version will be on the classpath, but since the new version has an alternative name you can refer to the newer package explicitly.

Have a look at https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html:

If the uber JAR is reused as a dependency of some other project, directly including classes from the artifact's dependencies in the uber JAR can cause class loading conflicts due to duplicate classes on the class path. To address this issue, one can relocate the classes which get included in the shaded artifact in order to create a private copy of their bytecode:

I got this idea from the video "Top 5 Mistakes When Writing Spark Applications": https://youtu.be/WyfHUNnMutg?t=23m1s

来源：https://stackoverflow.com/questions/42365013/ignore-spark-cluster-own-jars

标签

apache-spark