Why use softmax as opposed to standard normalization?

后端 未结 9 2211
一整个雨季
一整个雨季 2020-12-02 03:43

In the output layer of a neural network, it is typical to use the softmax function to approximate a probability distribution:

9条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-02 04:20

    I have found the explanation here to be very good: CS231n: Convolutional Neural Networks for Visual Recognition.

    On the surface the softmax algorithm seems to be a simple non-linear (we are spreading the data with exponential) normalization. However, there is more than that.

    Specifically there are a couple different views (same link as above):

    1. Information Theory - from the perspective of information theory the softmax function can be seen as trying to minimize the cross-entropy between the predictions and the truth.

    2. Probabilistic View - from this perspective we are in fact looking at the log-probabilities, thus when we perform exponentiation we end up with the raw probabilities. In this case the softmax equation find the MLE (Maximum Likelihood Estimate)

    In summary, even though the softmax equation seems like it could be arbitrary it is NOT. It is actually a rather principled way of normalizing the classifications to minimize cross-entropy/negative likelihood between predictions and the truth.

提交回复
热议问题