Receptive feild size and object size in deep learning

问题

I can calculate the receptive field size of 500 x 500 input image for VGGNet.

The receptive field sizes are as follow.

Layer Name = conv1, Output size = 500, Stride =   1, RF size =   3
Layer Name = relu1_1, Output size = 500, Stride =   1, RF size =   3
Layer Name = conv1_2, Output size = 500, Stride =   1, RF size =   5
Layer Name = relu1_2, Output size = 500, Stride =   1, RF size =   5
Layer Name = pool1, Output size = 250, Stride =   2, RF size =   6
Layer Name = conv2_1, Output size = 250, Stride =   2, RF size =  10
Layer Name = relu2_1, Output size = 250, Stride =   2, RF size =  10
Layer Name = conv2_2, Output size = 250, Stride =   2, RF size =  14
Layer Name = relu2_2, Output size = 250, Stride =   2, RF size =  14
Layer Name = pool2, Output size = 125, Stride =   4, RF size =  16
Layer Name = conv3_1, Output size = 125, Stride =   4, RF size =  24
Layer Name = relu3_1, Output size = 125, Stride =   4, RF size =  24
Layer Name = conv3_2, Output size = 125, Stride =   4, RF size =  32
Layer Name = relu3_2, Output size = 125, Stride =   4, RF size =  32
Layer Name = conv3_3, Output size = 125, Stride =   4, RF size =  40
Layer Name = relu3_3, Output size = 125, Stride =   4, RF size =  40
Layer Name = pool3, Output size =  62, Stride =   8, RF size =  44
Layer Name = conv4_1, Output size =  62, Stride =   8, RF size =  60
Layer Name = relu4_1, Output size =  62, Stride =   8, RF size =  60
Layer Name = conv4_2, Output size =  62, Stride =   8, RF size =  76
Layer Name = relu4_2, Output size =  62, Stride =   8, RF size =  76
Layer Name = conv4_3, Output size =  62, Stride =   8, RF size =  92
Layer Name = relu4_3, Output size =  62, Stride =   8, RF size =  92
Layer Name = pool4, Output size =  31, Stride =  16, RF size = 100
Layer Name = conv5_1, Output size =  31, Stride =  16, RF size = 132
Layer Name = relu5_1, Output size =  31, Stride =  16, RF size = 132
Layer Name = conv5_2, Output size =  31, Stride =  16, RF size = 164
Layer Name = relu5_2, Output size =  31, Stride =  16, RF size = 164
Layer Name = conv5_3, Output size =  31, Stride =  16, RF size = 196
Layer Name = relu5_3, Output size =  31, Stride =  16, RF size = 196

I look at only upto conv5_3.

For example, if my object size is 150 x 150 and my image size is 500 x 500.

Can I say that, the feature maps for earlier layers from conv1 to conv4_2 carry only partial features of my object and from conv5_2 to conv5_3, they carry the features of almost the whole object.

Is my consideration true?

But at conv5_3, my output_size is 31 x 31 only, so I can't visualize how it represents the whole object in the image, but every pixel in that conv5_3 layer represents 196 x 196 size of the original 500 x 500 image.

Is my consideration true?

回答1:

Theoretically...

Can I say that, the feature maps for earlier layers from conv1 to conv4_2 carry only partial features of my object and from conv5_2 to conv5_3, they carry the features of almost the whole object. Is my consideration true?

Yes! You even calculated yourself the receptive field (in the case of CNN, is the pixels in the image that can theoretically affect the value of one cell of the feature map)!

But at conv5_3, my output_size is 31 x 31 only, so I can't visualize how it represents the whole object in the image, but every pixel in that conv5_3 layer represents 196 x 196 size of the original 500 x 500 image. Is my consideration true?

Yes! But don't forget that although the feature map size is only 31x31, the stride of your features is 16. So each cell of the conv5_3 feature map represents a region 196x196 in the image (keep in mind that if the "input window" does not fit inside the image, the rest of the "input window" will be black e.g. filled with zero), and have stride 16x16 between each other. So that 31x31 feature map still fully capture the image (just that the stride is huge).

Effectively...

Okay, above we were talking about the theoretical receptive field, that is, the pixels in the image that have a probability larger than 0 of affecting one cell (or pixel) in the feature map (31x31, in that case). However, in practice, it heavily depends on the weights of your convolution kernels.

Take a look at this post about the effective receptive field (ERF) of CNNs (or, if you have plenty of time, go straight to the original paper).

In theory, when you stack more layers you can increase your receptive field linearly, however, in practice, things aren’t simple as we thought: not all pixels in the receptive field contribute equally to the output unit’s response.

What is actually more even interesting is that this receptive field is dynamic and changes during the training. The impact of this on the backpropagation is that the central pixels will have a larger gradient magnitude when compared to the border pixels.

Here are some figures from the papers that represents the ERF:

As you can see, the receptive field does not cover the whole patch at all! So don't be surprised if the ERF of the conv5_3 is much smaller than 196x196.

Also...

Apart from the size of receptive field, which basically says "this cell on feature map compresses valuable data from this patch of the image", you also need these features to be expressive enough. So, take a look at this post or search "vgg visualization" on google to have some intuitions on the expressiveness of the features itself.

来源：https://stackoverflow.com/questions/50148376/receptive-feild-size-and-object-size-in-deep-learning

标签

deep-learning

caffe

conv-neural-network