ERROR: Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt

让人想犯罪 __ 提交于 2019-12-11 05:24:04

问题


I run my detection model on google cloud ml and got this error while running the evaluation script. I found this link that mentioned about this issue, but it seems like the issue's till not be solved. Anyone knows how to fix this? Any helps would be appreciated. Thanks.

ERROR 2018-02-04 12:53:10 -0600 master-replica-0 Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt-0

INFO 2018-02-04 12:53:10 -0600 master-replica-0 No model found in gs://obj-detection/train. Will try again in 300 seconds

INFO 2018-02-04 12:58:10 -0600 master-replica-0 Starting evaluation at 2018-02-04-18:58:10

ERROR 2018-02-04 12:58:10 -0600 master-replica-0 Couldn't match files for checkpoint gs://obj-detection/train/model.ckpt-0

INFO 2018-02-04 12:58:10 -0600 master-replica-0 No model found in gs://obj-detection/train. Will try again in 300 seconds

...

While the training log is working as below:

... at somewhere around 14 hours running

INFO 2018-02-04 05:09:05 -0600 worker-replica-3 global step 185874: loss = 0.7012 (0.764 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-4 global step 185873: loss = 0.7749 (0.797 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-2 global step 185875: loss = 0.4939 (0.775 sec/step)

INFO 2018-02-04 05:09:05 -0600 master-replica-0 global step 185877: loss = 1.1430 (0.850 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-1 global step 185878: loss = 0.8231 (0.777 sec/step)

INFO 2018-02-04 05:09:05 -0600 worker-replica-0 global step 185881: loss = 0.6470 (0.779 sec/step)


回答1:


A few things to check:

  1. Is the training code setup to actually export checkpoints? If you're using an Estimator, this generally works, assuming you're using the standard methods for running the Estimator (e.g., in TF >=1.4, Estimator.train_and_evaluate).
  2. Are you passing the correct output directory to the code that is saving checkpoints? For instance, could the training code be outputting the checkpoint to a local (temporary?) directory instead of GCS? Could it be saving the checkpoints to a different directory on GCS? A quick scan of the code + some well placed print/logging statements are useful here.
  3. How frequently does the training code export checkpoints? e.g., if it saves only 10 minutes, then you would expect about 1-2 "no model found" messages for every successful evaluation.


来源:https://stackoverflow.com/questions/48612129/error-couldnt-match-files-for-checkpoint-gs-obj-detection-train-model-ckpt

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!