How to Pause / Resume Training in Tensorflow

后端未结

关注

 3  1838

鱼传尺愫 2021-01-05 03:15

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official

3条回答

我在风中等你 (楼主)

2021-01-05 04:06
Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

Things to keep in Mind:
1. Make sure you are saving your checkpoints. In tf.train.saver() you can specify max_checkpoints to keep.
2. Specify the directory of the checkpoints in the tf.train.MonitoredTrainingSession(checkpoint='dir_path',save_checkpoint_secs=). Based on the save_checkpoint_secs argument, the above session would keep saving and updating the checkpoints.
3. When you constantly keep saving the checkpoints, above function, looks for the latest checkpoint and resumes training from there.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

How to Pause / Resume Training in Tensorflow

Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

Things to keep in Mind: