Let\'s say I have network with following params:
1) What gets combined first - (1) the loss values of the class(for instance 10 values(one for each class) get combined per pixel) andthen all the pixels in the image or (2)all the pixels in the image for each individual class, then all the class losses are combined? 2) How exactly are these different pixel combinations happening - where is it being summed / where is it being averaged?
My answer for (1): When training a batch of images, an array consisting of pixel values is trained by calculating the non-linear function, loss and optimizing (updating the weights). The loss is not calculated for each pixel value; rather, it is done for each image.
The pixel values (X_train), weights and bias (b) are used in a sigmoid (for the simplest example of non-linearity) to calculate the predicted y value. This, along with the y_train (a batch at a time) is used to calculate the loss, which is optimized using one of the optimization methods like SGD, momentum, Adam, etc to update the weights and biases.
My answer for (2): During the non-linearity operation, the pixel values (X_train) are combined with the weights (through a dot product) and added to bias to form a predicted target value.
In a batch, there may be training examples belonging to different classes. The corresponding target values (for each class) are compared with the corresponding predicted values to compute the loss. These are Therefore, it is perfectly fine to sum all the losses.
It really doesn't matter if they belong to one class or multiple classes as long as you compare it with a corresponding target of the correct class. Make sense?
Although I have already mentioned part of this answer in a related answer, but let's inspect the source code step-by-step with more details to find the answer concretely.
First, Let's feedforward(!): there is a call to weighted_loss function which takes y_true, y_pred, sample_weight and mask as inputs:
weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
weighted_loss is actually an element of a list which contains all the (augmented) loss functions passed to fit method:
weighted_losses = [
weighted_masked_objective(fn) for fn in loss_functions]
The "augmented" word I mentioned is important here. That's because, as you can see above, the actual loss function is wrapped by another function called weighted_masked_objective which has been defined as follows:
def weighted_masked_objective(fn):
"""Adds support for masking and sample-weighting to an objective function.
It transforms an objective function `fn(y_true, y_pred)`
into a sample-weighted, cost-masked objective function
`fn(y_true, y_pred, weights, mask)`.
# Arguments
fn: The objective function to wrap,
with signature `fn(y_true, y_pred)`.
# Returns
A function with signature `fn(y_true, y_pred, weights, mask)`.
"""
if fn is None:
return None
def weighted(y_true, y_pred, weights, mask=None):
"""Wrapper function.
# Arguments
y_true: `y_true` argument of `fn`.
y_pred: `y_pred` argument of `fn`.
weights: Weights tensor.
mask: Mask tensor.
# Returns
Scalar tensor.
"""
# score_array has ndim >= 2
score_array = fn(y_true, y_pred)
if mask is not None:
# Cast the mask to floatX to avoid float64 upcasting in Theano
mask = K.cast(mask, K.floatx())
# mask should have the same shape as score_array
score_array *= mask
# the loss per batch should be proportional
# to the number of unmasked samples.
score_array /= K.mean(mask)
# apply sample weighting
if weights is not None:
# reduce score_array to same ndim as weight array
ndim = K.ndim(score_array)
weight_ndim = K.ndim(weights)
score_array = K.mean(score_array,
axis=list(range(weight_ndim, ndim)))
score_array *= weights
score_array /= K.mean(K.cast(K.not_equal(weights, 0), K.floatx()))
return K.mean(score_array)
return weighted
So, there is a nested function, weighted, that actually calls the real loss function fn in the line score_array = fn(y_true, y_pred). Now, to be concrete, in case of the example the OP provided, the fn (i.e. loss function) is binary_crossentropy. Therefore we need to take a look at the definition of binary_crossentropy() in Keras:
def binary_crossentropy(y_true, y_pred):
return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
which in turn, calls the backend function K.binary_crossentropy(). In case of using Tensorflow as the backend, the definition of K.binary_crossentropy() is as follows:
def binary_crossentropy(target, output, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
# Arguments
target: A tensor with the same shape as `output`.
output: A tensor.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
# Returns
A tensor.
"""
# Note: tf.nn.sigmoid_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output / (1 - output))
return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
logits=output)
The tf.nn.sigmoid_cross_entropy_with_logits returns:
A Tensor of the same shape as
logitswith the componentwise logistic losses.
Now, let's backpropagate(!): considering the above note, the output shape of K.binray_crossentropy would be the same as y_pred (or y_true). As the OP mentioned, y_true has a shape of (batch_size, img_dim, img_dim, num_classes). Therefore, the K.mean(..., axis=-1) is applied over a tensor of shape (batch_size, img_dim, img_dim, num_classes) which results in an output tensor of shape (batch_size, img_dim, img_dim). So the loss values of all classes are averaged for each pixel in the image. Hence, the shape of score_array in weighted function mentioned above would be (batch_size, img_dim, img_dim). There is one more step: the return statement in weighted function takes the mean again i.e. return K.mean(score_array). So how does it compute the mean? If you take a look at the definition of mean backend function you would find out that the axis argument is None by default:
def mean(x, axis=None, keepdims=False):
"""Mean of a tensor, alongside the specified axis.
# Arguments
x: A tensor or variable.
axis: A list of integer. Axes to compute the mean.
keepdims: A boolean, whether to keep the dimensions or not.
If `keepdims` is `False`, the rank of the tensor is reduced
by 1 for each entry in `axis`. If `keepdims` is `True`,
the reduced dimensions are retained with length 1.
# Returns
A tensor with the mean of elements of `x`.
"""
if x.dtype.base_dtype == tf.bool:
x = tf.cast(x, floatx())
return tf.reduce_mean(x, axis, keepdims)
And it calls the tf.reduce_mean() which given an axis=None argument, takes the mean over all the axes of input tensor and return one single value. Therefore, the mean of the whole tensor of shape (batch_size, img_dim, img_dim) is computed, which translates to taking the average over all the labels in the batch and over all their pixels, and is returned as one single scalar value which represents the loss value. Then, this loss value is reported back by Keras and is used for optimization.
Bonus: what if our model has multiple output layers and therefore multiple loss functions are used?
Remember the first piece of code I mentioned in this answer:
weighted_loss = weighted_losses[i]
# ...
output_loss = weighted_loss(y_true, y_pred, sample_weight, mask)
As you can see there is an i variable which is used for indexing the array. You may have guessed correctly: it is actually part of a loop which computes the loss value for each output layer using its designated loss function and then takes the (weighted) sum of all these loss values to compute the total loss:
# Compute total loss.
total_loss = None
with K.name_scope('loss'):
for i in range(len(self.outputs)):
if i in skip_target_indices:
continue
y_true = self.targets[i]
y_pred = self.outputs[i]
weighted_loss = weighted_losses[i]
sample_weight = sample_weights[i]
mask = masks[i]
loss_weight = loss_weights_list[i]
with K.name_scope(self.output_names[i] + '_loss'):
output_loss = weighted_loss(y_true, y_pred,
sample_weight, mask)
if len(self.outputs) > 1:
self.metrics_tensors.append(output_loss)
self.metrics_names.append(self.output_names[i] + '_loss')
if total_loss is None:
total_loss = loss_weight * output_loss
else:
total_loss += loss_weight * output_loss
if total_loss is None:
if not self.losses:
raise ValueError('The model cannot be compiled '
'because it has no loss to optimize.')
else:
total_loss = 0.
# Add regularization penalties
# and other layer-specific losses.
for loss_tensor in self.losses:
total_loss += loss_tensor