Distributed Tensorflow 1.0 Supervisor stuck if logdir is in HDFS
问题 I build the TF 1.0 binary on centOS 8 for CPU. My distributed training code for MNIST data works fine if the Supervisor’s logdir is in local disk. But if I change Supervisor’s logdir to HDFS, the code will stuck at Supervisor’s initialization: sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), logdir='hdfs://cdh-2:8020/tmp/example', global_step=global_step, init_op=init_op) I used gdb and found the C stack trace. It seems it has problems in _wrap_RecursivelyCreateDir() #0