How to Pause / Resume Training in Tensorflow

后端 未结 3 1838
鱼传尺愫
鱼传尺愫 2021-01-05 03:15

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official

3条回答
  •  我在风中等你
    2021-01-05 04:06

    Using tf.train.MonitoredTrainingSession() helped me to resume my training when my machine restarted.

    Things to keep in Mind:

    1. Make sure you are saving your checkpoints. In tf.train.saver() you can specify max_checkpoints to keep.
    2. Specify the directory of the checkpoints in the tf.train.MonitoredTrainingSession(checkpoint='dir_path',save_checkpoint_secs=). Based on the save_checkpoint_secs argument, the above session would keep saving and updating the checkpoints.
    3. When you constantly keep saving the checkpoints, above function, looks for the latest checkpoint and resumes training from there.

提交回复
热议问题