The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the
Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285
Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396
Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144
Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665
Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536
Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491
Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332
Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568
Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951
Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556
Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with
model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))
The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.
The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.
Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.