loss calculation over different batch sizes in keras

后端 未结 2 753
悲&欢浪女
悲&欢浪女 2020-12-18 05:58

I know that in theory, the loss of a network over a batch is just the sum of all the individual losses. This is reflected in the Keras code for calculating total loss. Relev

2条回答
  •  生来不讨喜
    2020-12-18 06:26

    The code you have posted concerns multi-output models where each output may have its own loss and weights. Hence, the loss values of different output layers are summed together. However, The individual losses are averaged over the batch as you can see in the losses.py file. For example this is the code related to binary cross-entropy loss:

    def binary_crossentropy(y_true, y_pred):
        return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
    

    Update: Right after adding the second part of the this answer (i.e. loss functions), as the OP, I was baffled by the axis=-1 in the definition of loss function and I thought to myself that it must be axis=0 to indicate the average over the batch?! Then I realized that all the K.mean() used in the definition of loss function are there for the case of an output layer consisting of multiple units. So where is the loss averaged over the batch? I inspected the code to find the answer: to get the loss value for a specific loss function, a function is called taking the true and predicted labels as well as the sample weights and mask as its inputs:

    weighted_loss = weighted_losses[i]
    # ...
    output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
    

    what is this weighted_losses[i] function? As you may find, it is an element of list of (augmented) loss functions:

    weighted_losses = [
        weighted_masked_objective(fn) for fn in loss_functions]
    

    fn is actually one of the loss functions defined in losses.py file or it may be a user-defined custom loss function. And now what is this weighted_masked_objective function? It has been defined in training_utils.py file:

    def weighted_masked_objective(fn):
        """Adds support for masking and sample-weighting to an objective function.
        It transforms an objective function `fn(y_true, y_pred)`
        into a sample-weighted, cost-masked objective function
        `fn(y_true, y_pred, weights, mask)`.
        # Arguments
            fn: The objective function to wrap,
                with signature `fn(y_true, y_pred)`.
        # Returns
            A function with signature `fn(y_true, y_pred, weights, mask)`.
        """
        if fn is None:
            return None
    
        def weighted(y_true, y_pred, weights, mask=None):
            """Wrapper function.
            # Arguments
                y_true: `y_true` argument of `fn`.
                y_pred: `y_pred` argument of `fn`.
                weights: Weights tensor.
                mask: Mask tensor.
            # Returns
                Scalar tensor.
            """
            # score_array has ndim >= 2
            score_array = fn(y_true, y_pred)
            if mask is not None:
                # Cast the mask to floatX to avoid float64 upcasting in Theano
                mask = K.cast(mask, K.floatx())
                # mask should have the same shape as score_array
                score_array *= mask
                #  the loss per batch should be proportional
                #  to the number of unmasked samples.
                score_array /= K.mean(mask)
    
            # apply sample weighting
            if weights is not None:
                # reduce score_array to same ndim as weight array
                ndim = K.ndim(score_array)
                weight_ndim = K.ndim(weights)
                score_array = K.mean(score_array,
                                     axis=list(range(weight_ndim, ndim)))
                score_array *= weights
                score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
            return K.mean(score_array)
    return weighted
    

    As you can see, first the per sample loss is computed in the line score_array = fn(y_true, y_pred) and then at the end the average of the losses is returned, i.e. return K.mean(score_array). So that confirms that the reported losses are the average of per sample losses in each batch.

    Note that K.mean(), in case of using Tensorflow as backend, calls the tf.reduce_mean() function. Now, when K.mean() is called without an axis argument (the default value of axis argument would be None), as it is called in weighted_masked_objective function, the corresponding call to tf.reduce_mean() computes the mean over all the axes and returns one single value. That's why no matter the shape of output layer and the loss function used, only one single loss value is used and reported by Keras (and it should be like this, because optimization algorithms need to minimize a scalar value, not a vector or tensor).

提交回复
热议问题