Numercially stable softmax

柔情痞子 提交于 2019-11-28 10:12:39

The softmax exp(x)/sum(exp(x)) is actually numerically well-behaved. It has only positive terms, so we needn't worry about loss of significance, and the denominator is at least as large as the numerator, so the result is guaranteed to fall between 0 and 1.

The only accident that might happen is over- or under-flow in the exponentials. Overflow of a single or underflow of all elements of x will render the output more or less useless.

But it is easy to guard against that by using the identity softmax(x) = softmax(x + c) which holds for any scalar c: Subtracting max(x) from x leaves a vector that has only non-positive entries, ruling out overflow and at least one element that is zero ruling out a vanishing denominator (underflow in some but not all entries is harmless).

Note: theoretically, catastrophic accidents in the sum are possible, but you'd need a ridiculous number of terms and be ridiculously unlucky. Also, numpy uses pairwise summation which is rather robust.

Softmax function is prone to two issues: overflow and underflow

Overflow: It occurs when very large numbers are approximated as infinity

Underflow: It occurs when very small numbers (near zero in the number line) are approximated (i.e. rounded to) as zero

To combat these issues when doing softmax computation, a common trick is to shift the input vector by subtracting the maximum element in it from all elements. For the input vector x, define z such that:

z = x-max(x)

And then take the softmax of the new (stable) vector z


Example:

In [266]: def stable_softmax(x):
     ...:     z = x - max(x)
     ...:     numerator = np.exp(z)
     ...:     denominator = np.sum(numerator)
     ...:     softmax = numerator/denominator
     ...:     return softmax
     ...: 

In [267]: vec = np.array([1, 2, 3, 4, 5])

In [268]: stable_softmax(vec)
Out[268]: array([ 0.01165623,  0.03168492,  0.08612854,  0.23412166,  0.63640865])

In [269]: vec = np.array([12345, 67890, 99999999])

In [270]: stable_softmax(vec)
Out[270]: array([ 0.,  0.,  1.])

For more details, see chapter Numerical Computation in deep learning book.

Thank Paul Panzer's explanation, but I am wondering why we need to subtract max(x). Therefore, I found more detailed information and hope it will be helpful to the people who has the same question as me. See the section, "What’s up with that max subtraction?", in the following link's article.

https://nolanbconaway.github.io/blog/2017/softmax-numpy

There is nothing wrong with calculating the softmax function as it is in your case. The problem seems to come from exploding gradient or this sort of issues with your training methods. Focus on those matters with either "clipping values" or "choosing the right initial distribution of weights".

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!