可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.

Error:

Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485

The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.

last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))

I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.

saver.save(session, sv.save_path, global_step) time.sleep(2) #wait for gcs to be updated

回答1:

From your comment I think I understand what is going on. I may be wrong.

The cloud_ml distributed sample https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426 uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.

In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.

One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.

文章来源: Checkpoint file not found, restoring evaluation graph

标签

checkpoint