How to train model with remaining epochs after long running session has ended in Google Colaboratory.?

血红的双手。 提交于 2019-12-11 05:38:49

问题


I am using Google Colab to train my 3D Convolutional neural network with 60 epochs but, when it reaches 57 epochs, my session is ended. After reconnecting it, the training starts from epoch 1.

What should I do to train my model on uncompleted epochs after my session on Google Colaboratory has been ended?


回答1:


The FAQ for Colaboratory includes these statements:

  1. What is Colaboratory? Colaboratory is a research tool for machine learning education and research.
  2. Colaboratory is intended for interactive use. Long-running background computations, particularly on GPUs, may be stopped. ... We encourage users who wish to run continuous or long-running computations through Colaboratory’s UI to use a local runtime.

Training a ML model typically requires long running computations. So the options I am considering are:

  1. Use a local runtime as suggested. This could be a Cloud VM or your laptop.
  2. Use Cloud DataLab - You control (and pay for) the VM resources in the google cloud.
  3. Checkpoint each epoch and save the checkpoint and weights to persistent storage. See the I/O cookbook. Then restart the training from the checkpoint if the Notebook is reset.

Option 3 might be the easiest to get going, given your training almost completes on Colaboratory. This depends on what libraries you are using, and whether they supports this.

If you are running a bigger computation then look at using local runtime or DataLab.




回答2:


You will have to save checkpoints after some interval. If your epochs execute fast, you can save the model after 5-10 epochs, otherwise save it after each epoch. And do check out the code to re-read the latest checkpoint(based on some naming convention). Now what are the problems with this.

  1. Since this is Collaboratory and it's free, they don't give you a dedicated GPU instance, and thus it can disconnect any time you refresh browser/ close browser/ lose internet connection etc.
  2. With that goes your temporary storage allocated to you.
  3. Plus the collaboratory gives you some limited amount of storage for your data and model.

So you need to save your checkpoints on some "PERSISTENT" storage. Collaboratory supports google drive. You can check how to save your files there. Also you'll have to check how to read from there.

Or if you are looking for some alternative. AWS spot instance can be a reasonable choice (however it's paid and if you can get some student credits from somewhere, you can use it). just to mention, Colab is also a spot instance by Google. You can also go to www.crestle.com, costs you 3 cents an hour.



来源:https://stackoverflow.com/questions/54402005/how-to-train-model-with-remaining-epochs-after-long-running-session-has-ended-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!