Tensorflow Object Detection Training Killed, Resource starvation?

我的未来我决定 提交于 2019-12-01 21:54:19
xbc

I met the same problem as you did. Actually,the memory full use is caused by the data_augmentation_options ssd_random_crop, so you can remove this option and set the batch size to 8 or smaller ie,2,4. When I set batch size to 1,I also met some problems cause by the nan loss.

Another thing is that the parameter epsilon should be a very small number, such as 1e-6 according to "deep learning" book. Because epsilon is used to avoid a zero denominator, but the default value here is 1, I don't think it is correct to set it to 1.

Alright, so after looking into it, and trying a few things, the problem ended up being in the Dmesg info I posted.

Training was taking up more than the 8 GB of memory that I had, so the solution ended up being using Swap space in order to increase the amount of Memory that the model had to pull from.

This is a problem many people face. There are multiple proposed solutions:

  • decrease batch size -- not always relevant, especially if you train on a GPU (which you should)
  • Increase your memory -- either by adding more or by using swap, as you suggested. However, if you use swap note that it is ~10-100x slower than RAM so everything will could take much longer
  • Best: decrease queue sizes -- it was noted that usually this problem is not directly associated with the model, but with the config. The default queue size is a bit too big, since the models are computationally heavy and do not process examples at a high rate.

I believe the third solution is the best for you since you are running out of CPU memory (RAM). And it doesn't slow down the training, nor does it affect your model.

To quote from the issue, with my comments:

The section in your new config will look like this:

train_input_reader: { tf_record_input_reader { input_path: "PATH_TO_BE_CONFIGURED/pet_train.record" } label_map_path: "PATH_TO_BE_CONFIGURED/pet_label_map.pbtxt" queue_capacity: 500 # change this number min_after_dequeue: 250 # change this number (strictly less than the above) }

You can also set these for eval_input_reader. For this one I am using 20, 10 and for train I use 400, 200, although I think I could go lower. My training takes less than 8Gb of RAM.

I was suffering from this problem since a while. I agree with @Ciprian's set of steps. I followed all of them and it turned out my situation was similar to @Derek's. My problem was solved by the extending the Swap space.

Just a few points for people stuck in this problem since it is difficult to debug it's occurrence in the object detection API, since the process can get killed due to multiple reasons.

  1. Use the following bash script to monitor CPU and Swap space usage. What you will find is after a certain number of steps, the swap space gets exhausted leading to the process getting killed.

watch -n 5 free -m

  1. Monitor the usage of your GPU with the following, just to make sure the GPU is not getting consumed entirely.

nvidia-smi

  1. If you do not see problems in any of the above steps, I suggest you not only decrease queue_capacity and min_after_dequeue, but also change the num_readers to 1 in the train_input_reader and eval_input_reader. Along with this, I'd suggest to play with batch_queue_capacity, num_batch_queue_threads and prefetch_queue_capacity to further reduce load on the CPU as suggested in this thread.
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!