I have a network that produces a 4D output tensor where the value at each position in spatial dimensions (~pixel) is to be interpreted as the class probabilities for that po
Just flatten the output to a 2D tensor of size (num_batches, height * width * num_classes). You can do this with the Flatten layer. Ensure that your y is flattened the same way (normally calling y = y.reshape((num_batches, height * width * num_classes)) is enough).
For your second question, using categorical crossentropy over all width*height predictions is essentially the same as averaging the categorical crossentropy for each width*height predictions (by the definition of categorical crossentropy).