Running custom Java class in PySpark

前端 未结 3 738
野的像风
野的像风 2020-12-05 16:58

I\'m trying to run a custom HDFS reader class in PySpark. This class is written in Java and I need to access it from PySpark, either from the shell or with spark-submit.

相关标签:
3条回答
  • 2020-12-05 17:02

    Problem you've described usually indicates that org.foo.module is not on the driver CLASSPATH. One possible solution is to use spark.driver.extraClassPath to add your jar file. It can be for example set in conf/spark-defaults.conf or provided as a command line parameter.

    On a side note:

    • if class you use is a custom input format there should be no need for using Py4j gateway whatsoever. You can simply use SparkContext.hadoop* / SparkContext.newAPIHadoop* methods.

    • using java_import(jvm, "org.foo.module.*") looks like a bad idea. Generally speaking you should avoid unnecessary imports on JVM. It is not public for a reason and you really don't want to mess with that. Especially when you access in a way which make this import completely obsolete. So drop java_import and stick with jvm.org.foo.module.Foo().

    0 讨论(0)
  • 2020-12-05 17:15

    Rather than --jars you should use --packages to import packages into your spark-submit action.

    0 讨论(0)
  • 2020-12-05 17:25

    In PySpark try the following

    from py4j.java_gateway import java_import
    java_import(sc._gateway.jvm,"org.foo.module.Foo")
    
    func = sc._gateway.jvm.Foo()
    func.fooMethod()
    

    Make sure that you have compiled your Java code into a runnable jar and submit the spark job like so

    spark-submit --driver-class-path "name_of_your_jar_file.jar" --jars "name_of_your_jar_file.jar" name_of_your_python_file.py
    
    0 讨论(0)
提交回复
热议问题