In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:
Suppose we change the softmax function so the output activations are given by
where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).