Databricks (Spark): .egg dependencies not installed automatically?

问题

I have a locally created .egg package that depends on boto==2.38.0. I used setuptools to create the build distribution. Everything works in my own local environment, as it fetches boto correctly from PiP. However on databricks it does not automatically fetch dependencies when I attach a library to the cluster.

I really struggled now for a few days trying to install a dependency automatically when loaded on databricks, I use setuptools; 'install_requires=['boto==2.38.0']' is the relevant field.

When I install boto directly from PyPi on the databricks server (so not relying on the install_requires field to work properly) and then call my own .egg, it does recognize that boto is a package, but it does not recognize any of its modules (since it is not imported on my own .egg's namespace???). So I cannot get my .egg to work. If this problem persists without having any solutions I'd think that is a really big problem for databricks users right now. There should be a solution of course...

Thank you!

回答1:

Your application's dependencies will not, in general, work properly if they are diverse and don't have uniform language support. The Databrick docs explain that

Databricks will install the correct version if the library supports both Python 2 and 3. If the library does not support Python 3 then library attachment will fail with an error.

In this case it will not automatically fetch dependencies when you attach a library to the cluster.

来源：https://stackoverflow.com/questions/32119225/databricks-spark-egg-dependencies-not-installed-automatically

标签

python

apache-spark

dependencies

pyspark

egg