In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
Suppose we change the softmax function so the output activations are given by
where c
is a positive constant. Note that c=1
corresponds to the standard softmax function. But if we use a different value of c
we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c
to become large, i.e., c→∞
. What is the limiting value for the output activations a^L_j
? After solving this problem it should be clear to you why we think of the c=1
function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).