Why use softmax as opposed to standard normalization?

后端 未结 9 2197
一整个雨季
一整个雨季 2020-12-02 03:43

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

9条回答
  •  一生所求
    2020-12-02 04:30

    Suppose we change the softmax function so the output activations are given by

    where c is a positive constant. Note that c=1 corresponds to the standard softmax function. But if we use a different value of c we get a different function, which is nonetheless qualitatively rather similar to the softmax. In particular, show that the output activations form a probability distribution, just as for the usual softmax. Suppose we allow c to become large, i.e., c→∞. What is the limiting value for the output activations a^L_j? After solving this problem it should be clear to you why we think of the c=1 function as a "softened" version of the maximum function. This is the origin of the term "softmax". You can follow the details from this source (equation 83).

提交回复
热议问题