Ray cluster configuration file_mounts section not allowing worker nodes to launch

跟風遠走 提交于 2019-12-11 15:57:50

问题


I am trying to distribute a small number of files to each node in a Ray cluster on AWS EC2, using the file_mounts block in the configuration file:-

file_mounts: { "./": "./run_files" }

The cluster launches with only a master node, onto which the contents of the run_files directory have been correctly copied. However, the two worker nodes that were requested do not launch. If I omit the file_mounts section, the workers launch. The Ray monitor indicates that there is a problem locating the file libtcl.so in the matplotlib sub-directory of the Anaconda3 installation. This file is on the correct path on the master node so it appears that the setup on worker nodes is not working properly:-

$ ray exec ray_conf.yaml  'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'
2019-05-29 19:36:14,019 INFO updater.py:95 -- NodeUpdater: Waiting for IP of i-073950262949fe9a8...
2019-05-29 19:36:14,019 INFO log_timer.py:21 -- NodeUpdater: i-073950262949fe9a8: Got IP [LogTimer=362ms]
2019-05-29 19:36:14,025 INFO updater.py:272 -- NodeUpdater: Running tail -n 100 -f /tmp/ray/session_*/logs/monitor* on 54.175.173.233...
==> /tmp/ray/session_2019-05-29_23-35-49_842129_4407/logs/monitor.err <==
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/monitor.py", line 376, in <module>
redis_password=args.redis_password)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/monitor.py", line 54, in __init__
self.load_metrics)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 349, in __init__
self.reload_config(errors_fatal=True)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 523, in reload_config
raise e
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 516, in reload_config
new_config["worker_start_ray_commands"]
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 790, in hash_runtime_conf
add_content_hashes(local_path)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 778, in add_content_hashes
add_hash_of_file(fpath)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/autoscaler/autoscaler.py", line 764, in add_hash_of_file
with open(fpath, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: './anaconda3/pkgs/matplotlib-2.1.0-py36hba5de38_0/lib/libtcl.so'

==> /tmp/ray/session_2019-05-29_23-35-49_842129_4407/logs/monitor.out <==

(Note that this problem follows on from the question "Workers not being launched on EC2 by ray", I have continued in a new question because the source of the error is now more specifically identified.)


回答1:


I think that the libtcl.so error message is very misleading. The problem is that the file_mounts remote path cannot be the home directory on the workers (neither ./ nor ~/ works); it has to be a sub-directory. So the following was successful:-

file_mounts: {"~/run_files": "./run_files"}


来源:https://stackoverflow.com/questions/56370163/ray-cluster-configuration-file-mounts-section-not-allowing-worker-nodes-to-launc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!