Let\'s say I have network with following params:
1) What gets combined first - (1) the loss values of the class(for instance 10 values(one for each class) get combined per pixel) andthen all the pixels in the image or (2)all the pixels in the image for each individual class, then all the class losses are combined? 2) How exactly are these different pixel combinations happening - where is it being summed / where is it being averaged?
My answer for (1): When training a batch of images, an array consisting of pixel values is trained by calculating the non-linear function, loss and optimizing (updating the weights). The loss is not calculated for each pixel value; rather, it is done for each image.
The pixel values (X_train), weights and bias (b) are used in a sigmoid (for the simplest example of non-linearity) to calculate the predicted y value. This, along with the y_train (a batch at a time) is used to calculate the loss, which is optimized using one of the optimization methods like SGD, momentum, Adam, etc to update the weights and biases.
My answer for (2): During the non-linearity operation, the pixel values (X_train) are combined with the weights (through a dot product) and added to bias to form a predicted target value.
In a batch, there may be training examples belonging to different classes. The corresponding target values (for each class) are compared with the corresponding predicted values to compute the loss. These are Therefore, it is perfectly fine to sum all the losses.
It really doesn't matter if they belong to one class or multiple classes as long as you compare it with a corresponding target of the correct class. Make sense?