Tensorflow Object Detection Training Killed, Resource starvation?

This question has partially been asked here and here with no follow-ups, so maybe this is not the venue to ask this question, but I've figured out a little more information that I'm hoping might get an answer to these questions.

I've been attempting to train object_detection on my own library of roughly 1k photos. I've been using the provided pipeline config file "ssd_inception_v2_pets.config". And I've set up the training data properly, I believe. The program appears to start training just fine. When it couldn't read the data, it alerted with an error, and I fixed that.

My train_config settings are as follows, though I've changed a few of the numbers in order to try and get it to run with fewer resources.

train_config: {
  batch_size: 1000 #also tried 1, 10, and 100
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.04  # also tried .004
          decay_steps: 800 # also tried 800720. 80072
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "~/Downloads/ssd_inception_v2_coco_11_06_2017/model.ckpt" #using inception checkpoint
  from_detection_checkpoint: true
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

Basically, what I think is happening is that the computer is getting resource starved very quickly, and I'm wondering if anyone has an optimization that takes more time to build, but uses fewer resources?

OR am I wrong about why the process is getting killed, and is there a way for me to get more information about that from the kernel?

This is the Dmesg information that I get after the process is killed.

[711708.975215] Out of memory: Kill process 22087 (python) score 517 or sacrifice child
[711708.975221] Killed process 22087 (python) total-vm:9086536kB, anon-rss:6114136kB, file-rss:24kB, shmem-rss:0kB

xbc

I met the same problem as you did. Actually,the memory full use is caused by the data_augmentation_options ssd_random_crop, so you can remove this option and set the batch size to 8 or smaller ie,2,4. When I set batch size to 1,I also met some problems cause by the nan loss.

Another thing is that the parameter epsilon should be a very small number, such as 1e^-6 according to "deep learning" book. Because epsilon is used to avoid a zero denominator, but the default value here is 1, I don't think it is correct to set it to 1.

Alright, so after looking into it, and trying a few things, the problem ended up being in the Dmesg info I posted.

Training was taking up more than the 8 GB of memory that I had, so the solution ended up being using Swap space in order to increase the amount of Memory that the model had to pull from.

This is a problem many people face. There are multiple proposed solutions:

decrease batch size -- not always relevant, especially if you train on a GPU (which you should)
Increase your memory -- either by adding more or by using swap, as you suggested. However, if you use swap note that it is ~10-100x slower than RAM so everything will could take much longer
Best: decrease queue sizes -- it was noted that usually this problem is not directly associated with the model, but with the config. The default queue size is a bit too big, since the models are computationally heavy and do not process examples at a high rate.

I believe the third solution is the best for you since you are running out of CPU memory (RAM). And it doesn't slow down the training, nor does it affect your model.

To quote from the issue, with my comments:

The section in your new config will look like this:

train_input_reader: { tf_record_input_reader { input_path: "PATH_TO_BE_CONFIGURED/pet_train.record" } label_map_path: "PATH_TO_BE_CONFIGURED/pet_label_map.pbtxt" queue_capacity: 500 # change this number min_after_dequeue: 250 # change this number (strictly less than the above) }

You can also set these for eval_input_reader. For this one I am using 20, 10 and for train I use 400, 200, although I think I could go lower. My training takes less than 8Gb of RAM.

I was suffering from this problem since a while. I agree with @Ciprian's set of steps. I followed all of them and it turned out my situation was similar to @Derek's. My problem was solved by the extending the Swap space.

Just a few points for people stuck in this problem since it is difficult to debug it's occurrence in the object detection API, since the process can get killed due to multiple reasons.

Use the following bash script to monitor CPU and Swap space usage. What you will find is after a certain number of steps, the swap space gets exhausted leading to the process getting killed.

watch -n 5 free -m

Monitor the usage of your GPU with the following, just to make sure the GPU is not getting consumed entirely.

nvidia-smi

If you do not see problems in any of the above steps, I suggest you not only decrease queue_capacity and min_after_dequeue, but also change the num_readers to 1 in the train_input_reader and eval_input_reader. Along with this, I'd suggest to play with batch_queue_capacity, num_batch_queue_threads and prefetch_queue_capacity to further reduce load on the CPU as suggested in this thread.

来源：https://stackoverflow.com/questions/45150773/tensorflow-object-detection-training-killed-resource-starvation

标签

tensorflow

linux-kernel

protocol-buffers

object-detection

training-data