I have a python project, whose folder has the structure
main_directory - lib - lib.py
- run - script.py
script.py
If you want to preserve project structure when submitting Dataroc job then you should package your project into a .zip file and specify it in --py-files parameter when submitting a job:
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
--py-files lib.zip \
run/script.py
To create zip archive you need to run script:
cd main_directory/
zip -x run/script.py -r libs.zip .
Refer to this blog post for more details on how to package dependencies in zip archive for PySpark jobs.