How to implement the Softmax derivative independently from any loss function?

匿名 (未验证) 提交于 2019-12-03 02:52:02

问题:

For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.

However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.

Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?

import numpy as np   class Softmax:      def compute(self, incoming):         exps = np.exp(incoming)         return exps / exps.sum()      def delta(self, incoming, outgoing):         exps = np.exp(incoming)         others = exps.sum() - exps         return 1 / (2 + exps / others + others / exps)   activation = Softmax() cost = SquaredError()  outgoing = activation.compute(incoming) delta_output_layer = activation.delta(incoming) * cost.delta(outgoing) 

回答1:

Mathematically, the derivative of Softmax(Xi) with respect to Xj is:

where the red delta is a Kronecker delta.

If you implement iteratively:

def softmax_grad(s):     # input s is softmax value of the original input x. Its shape is (1,n)      # e.i. s = np.array([0.3,0.7]), x = np.array([0,1])      # make the matrix whose size is n^2.     jacobian_m = np.diag(s)      for i in range(len(jacobian_m)):         for j in range(len(jacobian_m)):             if i == j:                 jacobian_m[i][j] = s[i] * (1-s[i])             else:                  jacobian_m[i][j] = -s[i]*s[j]     return jacobian_m 

Test:

In [95]: x Out[95]: array([1, 2])  In [96]: softmax(x) Out[96]: array([ 0.26894142,  0.73105858])  In [97]: softmax_grad(softmax(x)) Out[97]:  array([[ 0.19661193, -0.19661193],        [-0.19661193,  0.19661193]]) 

If you implement in a vectorized version:

soft_max = softmax(x)      # reshape softmax to 2d so np.dot gives matrix multiplication  def softmax_grad(softmax):     s = softmax.reshape(-1,1)     return np.diagflat(s) - np.dot(s, s.T)  softmax_grad(soft_max)  #array([[ 0.19661193, -0.19661193], #       [-0.19661193,  0.19661193]]) 


回答2:

It should be like this: (x is the input to the softmax layer and dy is the delta coming from the loss above it)

    dx = y * dy     s = dx.sum(axis=dx.ndim - 1, keepdims=True)     dx -= y * s      return dx 

But the way you compute the error should be:

    yact = activation.compute(x)     ycost = cost.compute(yact)     dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))  

Explanation: Because the delta function is a part of the backpropagation algorithm, its responsibility is to multiply the vector dy (in my code, outgoing in your case) by the Jacobian of the compute(x) function evaluated at x. If you work out what does this Jacobian look like for softmax [1], and then multiply it from the left by a vector dy, after a bit of algebra you'll find out that you get something that corresponds to my Python code.

[1] https://stats.stackexchange.com/questions/79454/softmax-layer-in-a-neural-network



易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!