Perform multi-scale training (yolov2)

问题

I am wondering how the multi-scale training in YOLOv2 works.

In the paper, it is stated that:

The original YOLO uses an input resolution of 448 × 448. ith the addition of anchor boxes we changed the resolution to 416×416. However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. "Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 × 320 and the largest is 608 × 608. We resize the network to that dimension and continue training. "

I don't get how a network with only convolutional and pooling layers allow input of different resolutions. From my experience of building neural networks, if you change the resolution of the input to different scale, the number of parameters of this network will change, that is, the structure of this network will change.

So, how does YOLOv2 change this on the fly?

I read the configuration file for yolov2, but all I got was a random=1 statement...

回答1:

if you only have convolutional layers, the number of weights does not change with the size of the 2D part of the layers (but it would change if you resized the number of channels, too).

for example (imagined network), if you have 224x224x3 input images and a 3x3x64 convolutional layer, you will have 64 different 3*3*3 convolutional filter kernels = 1728 weights. This value does not depend on the size of the image at all, since a kernel is applied on each position of the image independently, this is the most important thing of convolution and convolutional layers and the reason, why CNNs can go so deep, and why in faster R-CNN you can just crop the regions out of your feature map.

If there were any fully connected layers or something, it would not work this way, since there, bigger 2D layer dimension would lead to more connections and more weights.

In yolo v2, there is one thing that might look still not fitting right. For example if you double the image size in each dimension, you'll end up with 2 times the number of features in each dimension, right before the final 1x1xN filter, like if your grid was 7x7 for the original network size, the resized network might have 14x14. But then you'll just get 14x14 * B*(5+C) regression results, just fine.

回答2:

In YoLo if you are only using convolution layers , the size of the output gird changes.

For example if you have size of:

320x320, output size is 10x10
608x608, output size is 19x19

You then calculate loss on these w.r.t to the ground truth grid which is similarly adjusted.

Thus you can back propagate loss without adding any more parameters.

Refer yolov1 paper for the loss function:

Loss Function from the paper

You thus can in theory only adjust this function which depends upon the grid size and no model parameters and you should be good to go.

Paper Link: https://arxiv.org/pdf/1506.02640.pdf

In the video explanation by the author mentions the same.

Time: 14:53

Video Link

来源：https://stackoverflow.com/questions/50005852/perform-multi-scale-training-yolov2

标签

computer-vision

object-detection

convolutional-neural-network

yolo