for example, i have a folder:
/
- test.py
- test.yml
and the job is submited to spark cluster with:
gcloud beta dataproc jobs
Files distributed using SparkContext.addFile (and --files) can be accessed via SparkFiles. It provides two methods:
getRootDirectory() - returns root directory for distributed filesget(filename) - returns absolute path to the fileI am not sure if there are any Dataproc specific limitations but something like this should work just fine:
from pyspark import SparkFiles
with open(SparkFiles.get('test.yml')) as test_file:
logging.info(test_file.read())