GCP Dataproc custom image Python environment

你离开我真会死。 提交于 2021-01-27 05:40:23

问题


I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3. But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed. If I log on the cluster machine and run the pyspark command, starts the Anaconda PySpark, but if I log on with root user and run pyspark I have the pyspark with python 3.5.3. This is a very strange. What I don't understand is which user is used to create the image? Why I have a different environment for my user and root user? I expect that the image is provisioned with root user, so I expect that all my packages installed could be found from root user. Thanks in advance


回答1:


I'd recommend you first read Configure the cluster's Python environment which gives an overview of Dataproc's Python environment on different image versions, as well as instructions on how to install packages and select Python for PySpark jobs.

In your case, 1.4 already comes with miniconda3. Init actions and jobs are executed as root. /etc/profile.d/effective-python.sh is executed to initialize the Python environment when creating the cluster. But due to the order of custom image script (first) and (then) optional component activation order, miniconda3 was not yet initialized at custom image build time, so your script actually customizes the OS's system Python, then during cluster creation time, miniconda3 initializes Python which overrides the OS's system Python.

I found a solution that, in your custom image script, add this code at the beginning, it will put you in the same Python environment as that of your jobs:

# This is /usr/bin/python
which python 

# Activate miniconda3 optional component.
cat >>/etc/google-dataproc/dataproc.properties <<EOF
dataproc.components.activate=miniconda3
EOF
bash /usr/local/share/google/dataproc/bdutil/components/activate/miniconda3.sh
source /etc/profile.d/effective-python.sh

# Now this is /opt/conda/default/bin/python
which python 

then you could install packages, e.g.:

conda install <package> -y


来源:https://stackoverflow.com/questions/57008478/gcp-dataproc-custom-image-python-environment

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!