Understanding the Caffe Convolutional Layer

问题

I successfully compiled Caffe under Ubuntu and started to study how to define and train my own networks. However, I'm having trouble to understand how the convolutional layer produces its output. For example the second convolutional layer (conv2) of the LeNet MNIST tutorial (tutorial, lenet.prototxt) has 20 input images and 50 output images:

layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 50
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}

How are the output image O_0, ..., O_49 calculated? My intuition is that it works like this (I_i input images, K_j kernels, B_k biasses, * convolution operator):

O_0 = I_0 * K_0 + ... + I_19 * K_19 + B_0
O_1 = I_0 * K_20 + ... + I_19 * K_39 + B_1
...
O_49 = I_0 * K_980 + ... + I_19 * K_999 + B_49

Is this assumption correct?

回答1:

In the convolution layer, a kernel (also known as filter) convolves with the input feature map and generates the output feature map. A set of feature maps is known as a blob. Each kernel is a 3D object with C x H x W size. Here C is the no of planes or channels, H is the height and W is the width of the kernel. Usually, the kernel is square, i.e., the height and width of the kernel are same. The number of channels or planes or depth of the kernel should be the same as that of the input feature map or the image. The output of a convolution operation between an image and a set of kernels is known as a feature map. Also the subsequent convolutions between the feature maps and kernels generate outputs also known as feature maps.

As explained above, a kernel is a 3D array of numbers. As the kernel is sliding, or convolving, over the input image, it is multiplying the values in the kernel with the original pixel values of the image. So each value of the kernel array is multiplied with the corresponding pixel of the image. Finally, all the products are added together to give one value of the output feature map. Then the kernel slides by the given stride and starts the process again to generate the next value of the output feature map. One kernel generates one plane of output known as a feature map. Thus N kernels generate N feature maps. Each kernel has one bias element which is added to every value of the output feature map. The number of bias elements is equal to the number of kernels.

For a 3 x 3 x 3 kernel the convolution is computed as follows:

Here p, q and r depend on the stride.

Here's an animation from Stanford wiki that explains it beautifully:

The relation between the input and output feature maps is given as

Input:
C_in = No of channels in the input feature map
H_in = Height of the input feature map
W_in = Width of the input feature map

Output:
N_out = No of kernels
C_out = No of channels in the kernel
H_out = ( H_in + 2 x Padding Height - Kernel Height ) / Stride Height + 1
W_out = (W_in + 2 x Padding Width - Kernel Width) / Stride Width + 1

All the C_out planes are merged (accumulated) to form one plane. Thus the output is a set of N_out, H_out x W_out feature maps.

Taking the AlexNet example:

Layer 1

Input data (RGB image): (3, 227, 227)
conv1 kernel: (96, 3, 11, 11)
conv1 output: (96, 55, 55)

N = 96, C = 3, H = 11, W = 11
Padding Height = 0, Padding Width = 0
Stride Height = 4
Stride Width = 4

Here, each (3 x 11 x 11) kernel convolves with the (3 x 227 x 227) image such that each channel of the kernel convolves with the corresponding channel of the image. You may visualize it as a (11 x 11) mask convolving over a (227 x 227) input feature map to give a (55 x 55) output feature map. Such 3 channels are obtained for each kernel. Thereafter, the corresponding features of different channels are added together to give one (55 x 55) feature map. Such 96 feature maps are generated. Thus, a (96 x 55 x 55) output blob is obtained.

(55, 55) = ( (227 + 2 x 0 - 11) / 4 + 1, (227 + 2 x 0 - 11) / 4 + 1 )

ReLU

Pooling

Normalization

Layer 2

Input feature map: (96, 27, 27)
conv2 kernel: (256, 48, 5, 5)
conv2 output: (256, 27, 27)

N = 256, C = 48, H = 5, W = 5
Padding Height = 2, Padding Width = 2
Stride Height = 1
Stride Width = 1

Here, the input feature map has 96 channels but the kernel has 48 channels only. So, the input feature map is split into 2 sets of (48 x 27 x 27) feature maps. The 256 (48 x 5 x 5) kernels are also split into 2 sets of 128 (48 x 5 x 5) kernels. Thereafter, each set of the input feature map is convolved with a set of 128 (48 x 5 x 5) kernels. This results in 2 sets of (48 x 27 x 27) feature maps. Each of the 2 sets has 128 (48 x 27 x 27) feature maps. Now the 48 channels of each set are merged to yield 128 (27 x 27) feature maps. Thus 2 sets of 128 (27 x 27) feature maps are obtained. The two sets are now concatenated to yield a (256 x 27 x 27) output blob.

(27, 27) = ( (27 + 2 x 2 - 5)/1 + 1, (27 + 2 x 2 - 5)/1 + 1 )

P.S.: The math behind convolution is common across all tools whether it is Caffe, Keras, Tensorflow or Torch. It's just that the code is optimized differently in each implementation.

来源：https://stackoverflow.com/questions/45999157/understanding-the-caffe-convolutional-layer

标签

deep-learning

caffe

convolution