Pyspark --py-files doesn't work

后端 未结 7 1225
轻奢々
轻奢々 2020-12-31 01:14

I use this as document suggests http://spark.apache.org/docs/1.1.1/submitting-applications.html

spsark version 1.1.0

./spark/bin/spark-submit --py-f         


        
相关标签:
7条回答
  • 2020-12-31 01:43

    I was facing a similar kind of problem, My worker nodes could not detect the modules even though I was using the --py-files switch.

    There were couple of things I did - First I tried putting import statement after I created SparkContext (sc) variable hoping that import should take place after the module has shipped to all nodes but still it did not work. I then tried sc.addFile to add the module inside the script itself (instead of sending it as a command line argument) and afterwards imported the functions of the module. This did the trick at least in my case.

    0 讨论(0)
  • 2020-12-31 01:49

    You need to package your Python code using tools like setuptools. This will let you create an .egg file which is similar to java jar file. You can then specify the path of this egg file using --py-files

    spark-submit --py-files path_to_egg_file path_to_spark_driver_file

    0 讨论(0)
  • 2020-12-31 01:52

    Try to import your custom module from inside the method itself rather than at the top of the driver script, e.g.:

    def parse_record(record):
        import parser
        p = parser.parse(record)
        return p
    

    rather than

    import parser
    def parse_record(record):
        p = parser.parse(record)
        return p
    

    Cloud Pickle doesn't seem to recognise when a custom module has been imported, so it seems to try to pickle the top-level modules along with the other data that's needed to run the method. In my experience, this means that top-level modules appear to exist, but they lack usable members, and nested modules can't be used as expected. Once either importing with from A import * or from inside the method (import A.B), the modules worked as expected.

    0 讨论(0)
  • 2020-12-31 01:55

    Try this function of SparkContext

    sc.addPyFile(path)
    

    According to pyspark documentation here

    Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

    Try upload your python module file to a public cloud storage (e.g. AWS S3) and pass the URL to that method.

    Here is a more comprehensive reading material: http://www.cloudera.com/documentation/enterprise/5-5-x/topics/spark_python.html

    0 讨论(0)
  • 2020-12-31 01:57

    PySpark on EMR is configured for Python 2.6 by default, so make sure they're not being installed for the Python 2.7 interpreter

    0 讨论(0)
  • 2020-12-31 02:03

    Create zip files (example- abc.zip) containing all your dependencies.

    While creating the spark context mention the zip file name as:

        sc = SparkContext(conf=conf, pyFiles=["abc.zip"])
    
    0 讨论(0)
提交回复
热议问题