Question About Dropout Layer and Batch Normalization Layer in DNN model

问题

I have some queries about the Dropout layer and Batch normalized layer. Basically, I have made a simple DNN structure with a Dropout layer and Batch normalized layer and train it that's fine.

The simple structure of DNN model for example:

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(10, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(8, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(6, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1,activation='softmax'),
])

model.compile(
    optimizer='adam',
    loss='mae',
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=100,
    verbose=0,
)

But now I would like to use the train model's weights and bias of all layers in my custom prediction model(forget about the other way).

# Predictions for test
test_logits_1 = tf.matmul(tf_test_dataset, weights_1) + biases_1
test_relu_1 = tf.nn.relu(test_logits_1)

test_logits_2 = tf.matmul(test_relu_1, weights_2) + biases_2
test_relu_2 = tf.nn.relu(test_logits_2)

test_logits_3 = tf.matmul(test_relu_2, weights_3) + biases_3
test_relu_3 = tf.nn.relu(test_logits_3)

test_logits_4 = tf.matmul(test_logits_3 , weights_4) + biases_4
test_prediction = tf.nn.softmax(test_relu_4)

Now the question is here: have to need to add the dropout layer and batch normalized layer, batch size in the prediction model?? If yes then why to do that and how do I extract all the details of layers and use them in my custom prediction model?

回答1:

@Dr. Snoopy thanks for pointing out that the BatchNormalization has parameters but to my knowledge they are not the normalization weights(weights being normalized) based on what I was able to deduce from the docs and little research.

The doc says the following(quoted text below) and based on the description it is clear that beta and gamma values are trainable variables which tallies with the output from tensorflow.

During training (i.e. when using fit() or when calling the layer/model with the argument training=True), the layer normalizes its output using the mean and standard deviation of the current batch of inputs. That is to say, for each channel being normalized, the layer returns (batch - mean(batch)) / (var(batch) + epsilon) * gamma + beta, where:

epsilon is small constant (configurable as part of the constructor arguments)

gamma is a learned scaling factor (initialized as 1), which can be disabled by passing scale=False to the constructor.

beta is a learned offset factor (initialized as 0), which can be disabled by passing center=False to the constructor.

But that is not the end of the story as the model summary indicates more parameters than the number of parameters beta and gamma comprise of.

A factor of 4 can be observed here i.e. the number of parameters in a BatchNormalization layer are 4 times the input shape the layer operates on.

These additional parameters are moving_mean and moving_variance values which can be seen in the following output

Coming back to the original question and concern of OP, "What parameters should i worry about?", the parameters that are needed for inference are moving_mean, moving_variance, beta, and gamma values.

The way to use these values/parameters is again easily deducible from the docs which I quote here again-

During inference (i.e. when using evaluate() or predict() or when calling the layer/model with the argument training=False (which is the default), the layer normalizes its output using a moving average of the mean and standard deviation of the batches it has seen during training. That is to say, it returns (batch - self.moving_mean) / (self.moving_var + epsilon) * gamma + beta.

self.moving_mean and self.moving_var are non-trainable variables that are updated each time the layer in called in training mode, as such:

moving_mean = moving_mean * momentum + mean(batch) * (1 - momentum)

moving_var = moving_var * momentum + var(batch) * (1 - momentum)

As such, the layer will only normalize its inputs during inference after having been trained on data that has similar statistics as the inference data.

So assuming the moving_mean, moving_variance, beta, and gamma values are available for every BatchNormalization layer, I think the following piece of code needs to be added after the first activation-

# epsilon is just to avoid ZeroDivisionError, so the default value should be okay
test_BN_1 = (test_relu_1 - moving_mean_1) / (moving_var_1 + epsilon_1) * gamma_1 + beta_1

EDIT:

Turns out that the documentation seems to be wrong but the implementation seems to be right based on what I could deduce from the source code on github.

If you follow the following links you'll see that the in call method of BatchNormalization class here https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py#L1227 the calculation is actually done by keras backend normalization function batch_normalization here https://github.com/keras-team/keras/blob/35146d00b44ca645fbf4ad0b007faa07632c6f9e/keras/backend.py#L2963. The backend function doc string seems to be in agreement with what is mentioned in the reference paper and the picture you've posted.

So that means, you should use the square root of the variance only.

来源：https://stackoverflow.com/questions/65737437/question-about-dropout-layer-and-batch-normalization-layer-in-dnn-model

标签

python

tensorflow

keras