Intuitive understanding of 1D, 2D, and 3D convolutions in convolutional neural networks

前端 未结 4 871
臣服心动
臣服心动 2020-11-27 08:33

Can anyone please clearly explain the difference between 1D, 2D, and 3D convolutions in convolutional neural networks (in deep learning) with the use of examples?

4条回答
  •  伪装坚强ぢ
    2020-11-27 09:39

    Following the answer from @runhani I am adding a few more details to make the explanation a bit more clear and will try to explain this a bit more (and of course with exmaples from TF1 and TF2).

    One of the main additional bits I'm including are,

    • Emphasis on applications
    • Usage of tf.Variable
    • Clearer explanation of inputs/kernels/outputs 1D/2D/3D convolution
    • The effects of stride/padding

    1D Convolution

    Here's how you might do 1D convolution using TF 1 and TF 2.

    And to be specific my data has following shapes,

    • 1D vector - [batch size, width, in channels] (e.g. 1, 5, 1)
    • Kernel - [width, in channels, out channels] (e.g. 5, 1, 4)
    • Output - [batch size, width, out_channels] (e.g. 1, 5, 4)

    TF1 example

    import tensorflow as tf
    import numpy as np
    
    inp = tf.placeholder(shape=[None, 5, 1], dtype=tf.float32)
    kernel = tf.Variable(tf.initializers.glorot_uniform()([5, 1, 4]), dtype=tf.float32)
    out = tf.nn.conv1d(inp, kernel, stride=1, padding='SAME')
    
    with tf.Session() as sess:
      tf.global_variables_initializer().run()
      print(sess.run(out, feed_dict={inp: np.array([[[0],[1],[2],[3],[4]],[[5],[4],[3],[2],[1]]])}))
    

    TF2 Example

    import tensorflow as tf
    import numpy as np
    
    inp = np.array([[[0],[1],[2],[3],[4]],[[5],[4],[3],[2],[1]]]).astype(np.float32)
    kernel = tf.Variable(tf.initializers.glorot_uniform()([5, 1, 4]), dtype=tf.float32)
    out = tf.nn.conv1d(inp, kernel, stride=1, padding='SAME')
    print(out)
    
    

    It's way less work with TF2 as TF2 does not need Session and variable_initializer for example.

    What might this look like in real-life?

    So let's understand what this is doing using a signal smoothing example. On the left you got the original and on the right you got output of a Convolution 1D which has 3 output channels.

    What do multiple channels mean?

    Multiple channels are basically multiple feature representations of an input. In this example you have three representations obtained by three different filters. The first channel is the equally-weighted smoothing filter. The second is a filter that weights the middle of the filter more than the boundaries. The final filter does the opposite of the second. So you can see how these different filters bring about different effects.

    Deep learning applications of 1D convolution

    1D convolution has been successful used for the sentence classification task.

    2D Convolution

    Off to 2D convolution. If you are a deep learning person, chances that you haven't come across 2D convolution is … well about zero. It is used in CNNs for image classification, object detection, etc. as well as in NLP problems that involve images (e.g. image caption generation).

    Let's try an example, I got a convolution kernel with the following filters here,

    • Edge detection kernel (3x3 window)
    • Blur kernel (3x3 window)
    • Sharpen kernel (3x3 window)

    And to be specific my data has following shapes,

    • Image (black and white) - [batch_size, height, width, 1] (e.g. 1, 340, 371, 1)
    • Kernel (aka filters) - [height, width, in channels, out channels] (e.g. 3, 3, 1, 3)
    • Output (aka feature maps) - [batch_size, height, width, out_channels] (e.g. 1, 340, 371, 3)

    TF1 Example,

    import tensorflow as tf
    import numpy as np
    from PIL import Image
    
    im = np.array(Image.open().convert('L'))#/255.0
    
    kernel_init = np.array(
        [
         [[[-1, 1.0/9, 0]],[[-1, 1.0/9, -1]],[[-1, 1.0/9, 0]]],
         [[[-1, 1.0/9, -1]],[[8, 1.0/9,5]],[[-1, 1.0/9,-1]]],
         [[[-1, 1.0/9,0]],[[-1, 1.0/9,-1]],[[-1, 1.0/9, 0]]]
         ])
    
    inp = tf.placeholder(shape=[None, image_height, image_width, 1], dtype=tf.float32)
    kernel = tf.Variable(kernel_init, dtype=tf.float32)
    out = tf.nn.conv2d(inp, kernel, strides=[1,1,1,1], padding='SAME')
    
    with tf.Session() as sess:
      tf.global_variables_initializer().run()
      res = sess.run(out, feed_dict={inp: np.expand_dims(np.expand_dims(im,0),-1)})
    
    

    TF2 Example

    import tensorflow as tf
    import numpy as np
    from PIL import Image
    
    im = np.array(Image.open().convert('L'))#/255.0
    x = np.expand_dims(np.expand_dims(im,0),-1)
    
    kernel_init = np.array(
        [
         [[[-1, 1.0/9, 0]],[[-1, 1.0/9, -1]],[[-1, 1.0/9, 0]]],
         [[[-1, 1.0/9, -1]],[[8, 1.0/9,5]],[[-1, 1.0/9,-1]]],
         [[[-1, 1.0/9,0]],[[-1, 1.0/9,-1]],[[-1, 1.0/9, 0]]]
         ])
    
    kernel = tf.Variable(kernel_init, dtype=tf.float32)
    
    out = tf.nn.conv2d(x, kernel, strides=[1,1,1,1], padding='SAME')
    

    What might this look like in real life?

    Here you can see the output produced by above code. The first image is the original and going clock-wise you have outputs of the 1st filter, 2nd filter and 3 filter.

    What do multiple channels mean?

    In the context if 2D convolution, it is much easier to understand what these multiple channels mean. Say you are doing face recognition. You can think of (this is a very unrealistic simplification but gets the point across) each filter represents an eye, mouth, nose, etc. So that each feature map would be a binary representation of whether that feature is there in the image you provided. I don't think I need to stress that for a face recognition model those are very valuable features. More information in this article.

    This is an illustration of what I'm trying to articulate.

    Deep learning applications of 2D convolution

    2D convolution is very prevalent in the realm of deep learning.

    CNNs (Convolution Neural Networks) use 2D convolution operation for almost all computer vision tasks (e.g. Image classification, object detection, video classification).

    3D Convolution

    Now it becomes increasingly difficult to illustrate what's going as the number of dimensions increase. But with good understanding of how 1D and 2D convolution works, it's very straight-forward to generalize that understanding to 3D convolution. So here goes.

    And to be specific my data has following shapes,

    • 3D data (LIDAR) - [batch size, height, width, depth, in channels] (e.g. 1, 200, 200, 200, 1)
    • Kernel - [height, width, depth, in channels, out channels] (e.g. 5, 5, 5, 1, 3)
    • Output - [batch size, width, height, width, depth, out_channels] (e.g. 1, 200, 200, 2000, 3)

    TF1 Example

    import tensorflow as tf
    import numpy as np
    
    tf.reset_default_graph()
    
    inp = tf.placeholder(shape=[None, 200, 200, 200, 1], dtype=tf.float32)
    kernel = tf.Variable(tf.initializers.glorot_uniform()([5,5,5,1,3]), dtype=tf.float32)
    out = tf.nn.conv3d(inp, kernel, strides=[1,1,1,1,1], padding='SAME')
    
    with tf.Session() as sess:
      tf.global_variables_initializer().run()
      res = sess.run(out, feed_dict={inp: np.random.normal(size=(1,200,200,200,1))})
    
    

    TF2 Example

    import tensorflow as tf
    import numpy as np
    
    x = np.random.normal(size=(1,200,200,200,1))
    kernel = tf.Variable(tf.initializers.glorot_uniform()([5,5,5,1,3]), dtype=tf.float32)
    out = tf.nn.conv3d(x, kernel, strides=[1,1,1,1,1], padding='SAME') 
    

    Deep learning applications of 3D convolution

    3D convolution has been used when developing machine learning applications involving LIDAR (Light Detection and Ranging) data which is 3 dimensional in nature.

    What... more jargon?: Stride and padding

    Alright you're nearly there. So hold on. Let's see what is stride and padding is. They are quite intuitive if you think about them.

    If you stride across a corridor, you get there faster in fewer steps. But it also means that you observed lesser surrounding than if you walked across the room. Let's now reinforce our understanding with a pretty picture too! Let's understand these via 2D convolution.

    Understanding stride

    When you use tf.nn.conv2d for example, you need to set it as a vector of 4 elements. There's no reason to get intimidated by this. It just contain the strides in the following order.

    • 2D Convolution - [batch stride, height stride, width stride, channel stride]. Here, batch stride and channel stride you just set to one (I've been implementing deep learning models for 5 years and never had to set them to anything except one). So that leaves you only with 2 strides to set.

    • 3D Convolution - [batch stride, height stride, width stride, depth stride, channel stride]. Here you worry about height/width/depth strides only.

    Understanding padding

    Now, you notice that no matter how small your stride is (i.e. 1) there is an unavoidable dimension reduction happening during convolution (e.g. width is 3 after convolving a 4 unit wide image). This is undesirable especially when building deep convolution neural networks. This is where padding comes to the rescue. There are two most commonly used padding types.

    • SAME and VALID

    Below you can see the difference.

    Final word: If you are very curious, you might be wondering. We just dropped a bomb on whole automatic dimension reduction and now talking about having different strides. But the best thing about stride is that you control when where and how the dimensions get reduced.

提交回复
热议问题