How to use external (custom) package in pyspark?

[亡魂溺海] 提交于 2019-12-18 07:03:17

问题


I am trying to replicate the soultion given here https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_python.html to import external packages in pypspark. But it is failing.

My code:

spark_distro.py

from pyspark import SparkContext, SparkConf

def import_my_special_package(x):
    from external_package import external
    return external.fun(x)

conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x)).collect()

external_package.py

class external:

    def __init__(self,in):
        self.in = in

    def fun(self,in):
        return self.in*3

spark submit command:

spark-submit \
   --master yarn \
  /path to script/spark_distro.py  \
  --py-files /path to script/external_package.py \
  1000

Actual Error:

Actual:
  vs = list(itertools.islice(iterator, batch))
  File "/home/gsurapur/pyspark_examples/spark_distro.py", line 13, in <lambda>
  File "/home/gsurapur/pyspark_examples/spark_distro.py", line 6, in import_my_special_package
ImportError: No module named external_package

Expected output:

[3,6,9,12]

I tried sc.addPyFile option too and it is failing with same issue. Please help me finding the issue. Thanks in advance.


回答1:


I know that, in hindsight, it sounds silly, but the order of the arguments of spark-submit is not in general interchangeable: all Spark-related arguments, including --py-file, must be before the script to be executed:

# your case:
spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro.py --py-files /home/ctsats/scripts/SO/external_package.py
[...]
ImportError: No module named external_package

# correct usage:
spark-submit --master yarn-client --py-files /home/ctsats/scripts/SO/external_package.py /home/ctsats/scripts/SO/spark_distro.py
[...]
[3, 6, 9, 12]

Tested with your scripts modified as follows:

spark_distro.py

from pyspark import SparkContext, SparkConf

def import_my_special_package(x):
    from external_package import external
    return external(x)

conf = SparkConf()
sc = SparkContext()
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()

external_package.py

def external(x):
     return x*3

with the modifications arguably not changing the essence of the question...




回答2:


Here is the situation regarding addPyFile:

spark_distro2.py

from pyspark import SparkContext, SparkConf

def import_my_special_package(x):
    from external_package import external
    return external(x)

conf = SparkConf()
sc = SparkContext()
sc.addPyFile("/home/ctsats/scripts/SO/external_package.py") # added
int_rdd = sc.parallelize([1, 2, 3, 4])
print int_rdd.map(lambda x: import_my_special_package(x)).collect()

Test:

spark-submit --master yarn-client /home/ctsats/scripts/SO/spark_distro2.py
[...]
[3, 6, 9, 12]


来源:https://stackoverflow.com/questions/47101375/how-to-use-external-custom-package-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!