Checkpoint file not found, restoring evaluation graph

问题

I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.

Error:

Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485

The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.

last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))

I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.

saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated

回答1:

From your comment I think I understand what is going on. I may be wrong.

The cloud_ml distributed sample https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426 uses a temporary file by default. As a consequence, it works locally on the /tmp. Once the training is complete, it copies the result on gs:// but it does not correct the checkpoint file which stills contains references to local model files on /tmp. Basically, this is a bug.

In order to avoid this, you should launch the training process with --write_to_tmp 0 or modify the task.py file directly for disabling this option. Tensorflow will then directly work on gs:// and the resulting checkpoint will therefore be consistent. At least it worked for me.

One way of checking if my assumptions are correct is to copy the resulting checkpoint file from gs:// on your local filesystem using gsutils and then output its content.

来源：https://stackoverflow.com/questions/40189116/checkpoint-file-not-found-restoring-evaluation-graph

标签

tensorflow

google-cloud-ml