use an external library in pyspark job in a Spark cluster from google-dataproc

后端 未结 2 1281
我在风中等你
我在风中等你 2020-12-09 06:36

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I f

相关标签:
2条回答
  • 2020-12-09 07:13

    Additionally to @Dennis.

    Note that if you need to load multiple external packages, you need to specify a custom escape character like so:

    --properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌​bricks:spark-avro_2.10:2.0.1
    

    Note the ^#^ right before the package list. See gcloud topic escaping for more details.

    0 讨论(0)
  • 2020-12-09 07:22

    Short Answer

    There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:

    gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
        --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
    

    Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.

    Long Answer

    So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:

    # Doesn't work if job.py depends on that package.
    spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0
    

    But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:

    # Works with dependencies on that package.
    spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
    pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
    pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py
    

    So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.

    Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:

    gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
        --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
    

    Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.

    0 讨论(0)
提交回复
热议问题