Ordering of batch normalization and dropout?

前端 未结 9 2294
小蘑菇
小蘑菇 2020-12-12 08:10

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the

相关标签:
9条回答
  • Usually, Just drop the Dropout(when you have BN):

    • "BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively"
    • "Architectures like ResNet, DenseNet, etc. not using Dropout

    For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.

    0 讨论(0)
  • 2020-12-12 08:56

    Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285

    Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396

    Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144

    Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665

    Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536

    Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491

    Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332

    Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568

    Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951

    Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556


    Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

    model.add(Flatten())
    model.add(layers.Dense(512, activation="elu"))
    model.add(layers.Dense(10, activation="softmax"))
    

    The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

    The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

    Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.

    0 讨论(0)
  • 2020-12-12 08:57

    In the Ioffe and Szegedy 2015, the authors state that "we would like to ensure that for any parameter values, the network always produces activations with the desired distribution". So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

    As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

    So in summary, the order of using batch normalization and dropout is:

    -> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

    0 讨论(0)
  • 2020-12-12 09:02

    The correct order is: Conv > Normalization > Activation > Dropout > Pooling

    0 讨论(0)
  • 2020-12-12 09:06

    I read the recommended papers in the answer and comments from https://stackoverflow.com/a/40295999/8625228

    From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

    From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

    On the safety background, I use Dropout or BN only in the network.

    Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

    Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

    Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.

    0 讨论(0)
  • 2020-12-12 09:08

    ConV/FC - BN - Sigmoid/tanh - dropout. If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task

    0 讨论(0)
提交回复
热议问题