Checkpoint file not found, restoring evaluation graph

匿名 (未验证) 提交于 2019-12-03 00:44:02

问题:

I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.

Error:

Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485

The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.

last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path)) 

I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.

saver.save(session, sv.save_path, global_step) time.sleep(2) #wait for gcs to be updated 

回答1:

From your comment I think I understand what is going on. I may be wrong.

The cloud_ml distributed sample https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426 uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.

In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.

One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!