问题
I have a model which runs in a distributed mode for 4000 steps. After every 120s the accuracies are calculated (as is done in the provided examples). However, at times the last checkpoint file is not found.
Error:
Couldn't match files for checkpoint gs://path-on-gcs/train/model.ckpt-1485
The checkpoint file is present at the location. A local run for 2000 steps runs perfectly.
last_checkpoint = tf.train.latest_checkpoint(train_dir(FLAGS.output_path))
I assume that the checkpoint is still in saving process, and the files are not actually written. Tried introducing a wait before the accuracies are calculated as such. However, this seemed to work at first, the model still failed with a similar issue.
saver.save(session, sv.save_path, global_step)
time.sleep(2) #wait for gcs to be updated
回答1:
From your comment I think I understand what is going on. I may be wrong.
The cloud_ml distributed sample
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/mnist/hptuning/trainer/task.py#L426
uses a temporary file by default. As a consequence, it works locally on the /tmp
. Once the training is complete, it copies the result on gs://
but it does not correct the checkpoint
file which stills contains references to local model files on /tmp
. Basically, this is a bug.
In order to avoid this, you should launch the training process with --write_to_tmp 0
or modify the task.py
file directly for disabling this option. Tensorflow will then directly work on gs://
and the resulting checkpoint will therefore be consistent. At least it worked for me.
One way of checking if my assumptions are correct is to copy the resulting checkpoint
file from gs://
on your local filesystem using gsutils
and then output its content.
来源:https://stackoverflow.com/questions/40189116/checkpoint-file-not-found-restoring-evaluation-graph