I am using a Softmax activation function in the last layer of a neural network. But I have problems with a safe implementation of this function.
A naive implementati
I know it's already answered but I'll post here a step-by-step anyway.
put on log:
zj = wj . x + bj
oj = exp(zj)/sum_i{ exp(zi) }
log oj = zj - log sum_i{ exp(zi) }
Let m be the max_i { zi } use the log-sum-exp trick:
log oj = zj - log {sum_i { exp(zi + m - m)}}
= zj - log {sum_i { exp(m) exp(zi - m) }},
= zj - log {exp(m) sum_i {exp(zi - m)}}
= zj - m - log {sum_i { exp(zi - m)}}
the term exp(zi-m) can suffer underflow if m is much greater than other z_i, but that's ok since this means z_i is irrelevant on the softmax output after normalization. final results is:
oj = exp (zj - m - log{sum_i{exp(zi-m)}})
First go to log scale, i.e calculate log(y) instead of y. The log of the numerator is trivial. In order to calculate the log of the denominator, you can use the following 'trick': http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/