pyspark addPyFile to add zip of .py files, but module still not found

前端 未结 2 2105
长发绾君心
长发绾君心 2020-12-08 21:19

Using addPyFiles() seems to not be adding desiered files to spark job nodes (new to spark so may be missing some basic usage knowledge here).

Attempting

相关标签:
2条回答
  • 2020-12-08 21:53

    if your module is as below

    myModule \n
    - init.py
    -spark1.py
    -spark2.py

    Don't go inside myModule folder and add to zip. This error you mentioned.

    Instead, go outside the myModule folder. right-click and add myModule folder to zip and give another name.

    The idea is when spark extract your zip, there should be myModule folder exist with same name and hyrarchy

    0 讨论(0)
  • 2020-12-08 22:07

    Fixed problem. Admittedly, solution is not totally spark-related, but leaving question posted for the sake of others who may have similar problem, since the given error message did not make my mistake totally clear from the start.

    TLDR: Make sure the package contents (so they should include an __init.py__ in each dir.) of the zip file being loaded in are structured and named the way your code expects.


    The package I was trying to load into the spark context via zip was of the form

    mypkg
        file1.py
        file2.py
        subpkg1
            file11.py
        subpkg2
            file21.py
    

    my zip when running less mypkg.zip, showed

    file1.py file2.py subpkg1 subpkg2

    So two things were wrong here.

    1. Was not zipping the toplevel dir. that was the main package that the coded was expecting to work with
    2. Was not zipping the lower level dirs.

    Solved with zip -r mypkg.zip mypkg

    More specifically, had to make 2 zip files

    1. for the dist-keras package:

      cd dist-keras; zip -r distkeras.zip distkeras

    see https://github.com/cerndb/dist-keras/tree/master/distkeras

    1. for the keras package used by distkeras (which is not installed across the cluster):

      cd keras; zip -r keras.zip keras

    see https://github.com/keras-team/keras/tree/master/keras

    So declaring the spark session looked like

    conf = SparkConf()
    conf.set("spark.app.name", application_name)
    conf.set("spark.master", master)  #master='yarn-client'
    conf.set("spark.executor.cores", `num_cores`)
    conf.set("spark.executor.instances", `num_executors`)
    conf.set("spark.locality.wait", "0")
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
    
    # Check if the user is running Spark 2.0 +
    if using_spark_2:
        from pyspark.sql import SparkSession
    
        sc = SparkSession.builder.config(conf=conf) \
                .appName(application_name) \
                .getOrCreate()
        sc.sparkContext.addPyFile("/home/me/projects/keras-projects/exploring-keras/keras-dist_test/dist-keras/distkeras.zip")
        sc.sparkContext.addPyFile("/home/me/projects/keras-projects/exploring-keras/keras-dist_test/keras/keras.zip")
        print sc.version
    
    0 讨论(0)
提交回复
热议问题