How do I import classes from one or more local .jar files into a Spark/Scala Notebook?

问题

I am struggling to load classes from JARs into my Scala-Spark kernel Jupyter notebook. I have jars at this location:

/home/hadoop/src/main/scala/com/linkedin/relevance/isolationforest/

with contents listed as follows:

-rwx------ 1 hadoop hadoop   7170 Sep 11 20:54 BaggedPoint.scala
-rw-rw-r-- 1 hadoop hadoop 186719 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1.jar
-rw-rw-r-- 1 hadoop hadoop   1482 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1-javadoc.jar
-rw-rw-r-- 1 hadoop hadoop  20252 Sep 11 21:36 isolation-forest_2.3.0_2.11-1.0.1-sources.jar
-rwx------ 1 hadoop hadoop  16133 Sep 11 20:54 IsolationForestModelReadWrite.scala
-rwx------ 1 hadoop hadoop   5740 Sep 11 20:54 IsolationForestModel.scala
-rwx------ 1 hadoop hadoop   4057 Sep 11 20:54 IsolationForestParams.scala
-rwx------ 1 hadoop hadoop  11301 Sep 11 20:54 IsolationForest.scala
-rwx------ 1 hadoop hadoop   7990 Sep 11 20:54 IsolationTree.scala
drwxrwxr-x 2 hadoop hadoop    157 Sep 11 21:35 libs
-rwx------ 1 hadoop hadoop   1731 Sep 11 20:54 Nodes.scala
-rwx------ 1 hadoop hadoop    854 Sep 11 20:54 Utils.scala

When I attempt to load the IsolationForest class like so:

import com.linkedin.relevance.isolationforest.IsolationForest

I get the following error in my notebook:

<console>:33: error: object linkedin is not a member of package com
       import com.linkedin.relevance.isolationforest.IsolationForest

I've been Googling for several hours now to get to this point but am unable to progress further. What is the next step?

By the way, I am attempting to use this package: https://github.com/linkedin/isolation-forest

Thank you.

回答1:

For Scala:

if you're using spylon-kernel, then you can specify additional jars in the %%init_spark section, as described in the docs (first is for jar file, second is for package, as described below):

%%init_spark
launcher.jars = ["/some/local/path/to/a/file.jar"]
launcher.packages = ["com.acme:super:1.0.1"]

For Python:

in the first cells of Jupyter notebook, before initializing the SparkSession, do the following:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars <full_path_to>/isolation-forest_2.3.0_2.11-1.0.1.jar pyspark-shell'

this will add the jars into the PySpark context. But it's better to use --packages instead of --jars because it will also fetch all necessary dependencies, and put everything into the internal cache. For example

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.linkedin.isolation-forest:isolation-forest_2.3.0_2.11:1.0.0 pyspark-shell'

You only need to select version that matches your PySpark and Scala version (2.3.x & 2.4 are Scala 2.11, 3.0 is Scala 2.12), as it's listed in the Git repo.

回答2:

I got the following to work with pure Scala, Jupyter Lab, and Almond, which uses Ammonite, no Spark or any other heavy overlay involved:

interp.load.cp (os.pwd/"yourfile.jar")

The above, added as a statement in the notebook directly, loads yourfile.jar from the current directory. After this you can import from the jar. For instance, import yourfile._, if yourfile is the name of the top level package. I observed one caveat that one should wait a bit, until the kernel starts properly, before attempting to load. If the first statement is run too fast (for instance with restart and run all) then the whol thing hangs. This seems to be an unrelated issue.

You can, of course, construct another path (look over here for the available API). Also under the ammonite magic imports link from above you will find info how to load a package from ivy or how to load a Scala script as well. The trick is to use the interp object and the LoadJar trait that you can access from it. LoadJar has the following API:

trait LoadJar {

  /**
   * Load a `.jar` file or directory into your JVM classpath
   */
  def cp(jar: os.Path): Unit
  /**
    * Load a `.jar` from a URL into your JVM classpath
    */
  def cp(jar: java.net.URL): Unit
  /**
   * Load one or more `.jar` files or directories into your JVM classpath
   */
  def cp(jars: Seq[os.Path]): Unit
  /**
   * Load a library from its maven/ivy coordinates
   */
  def ivy(coordinates: Dependency*): Unit
}

来源：https://stackoverflow.com/questions/63854636/how-do-i-import-classes-from-one-or-more-local-jar-files-into-a-spark-scala-not

标签

scala

apache-spark

Hadoop

jar