Distributed Tensorflow: CreateSession still waiting

喜夏-厌秋 提交于 2019-11-29 23:44:42

问题


Simple script below is launched with args shown in it's header. It behaves differently, but often one of the workers hangs and prints these "CreateSession still waiting for some other task" messages. Why does a new MonitoredTrainingSession need others? And why don't the others wait for it to start?

# #!/bin/bash
# python train.py --job master --task 0 &
# python train.py --job worker --task 0 &
# python train.py --job worker --task 1 &
# python train.py --job worker --task 2 &
import argparse
import tensorflow as tf

parser = argparse.ArgumentParser()
parser.add_argument('--job', type=str)
parser.add_argument('--task', type=int)
args = parser.parse_args()
hosts = {
    "master": [
        "localhost:2222",
    ],
    "worker": [
        "localhost:2223",
        "localhost:2224",
        "localhost:2225",
    ]
}

nworkers = len(hosts['worker'])
cluster = tf.train.ClusterSpec(hosts)
server = tf.train.Server(cluster, job_name=args.job, task_index=args.task)

with tf.device(f'/job:master/task:0'):
    global_step = tf.train.get_or_create_global_step()
    inc_global_step = tf.assign(global_step, global_step + 1)

if args.job == 'worker':
    hooks = [
        tf.train.StopAtStepHook(last_step=4),
    ]
    with tf.train.MonitoredTrainingSession(master=server.target,
                                           is_chief=(args.task == 0),
                                           hooks=hooks) as sess:
        while not sess.should_stop():
            print(args.task, sess.run(inc_global_step))
else:
    server.join()

It could wait for the chief to init it's variables. But it happens to wait for another non-chief worker too. So, does MonitoredTrainingSession synchronise tasks? If it doesn't, are FIFOQueues the only primitive to do manual synchronisation?


回答1:


By default, a distributed TensorFlow session will attempt to connect to all servers named in the tf.train.ClusterSpec, and will block until they respond. This provides a useful barrier that ensures that all workers have become ready to receive computation requests before returning control to the user. This barrier happens before the MonitoredTrainingSession code that waits for the chief to initialize variables.

If you don't want your session to wait on all servers (e.g. just wait on tasks in "/job:ps" and not the other tasks in "/job:worker", which is a common between-graph deployment strategy), the easiest option is to specify a "device filter" when you create your session. The device filter is a whitelist of (partial) device specifications that determines which tasks a tf.Session will contact at startup. For example, the mnist_replica.py test specifies a device filter as part of the tf.ConfigProto that is used to configure the session.



来源:https://stackoverflow.com/questions/46429766/distributed-tensorflow-createsession-still-waiting

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!