Why is softmax not used in hidden layers [duplicate]

问题

I have read the answer given here. My exact question pertains to the accepted answer:

Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.

What are the complications that forgoing the variable independence in hidden layers arises? Please provide at least one example. I know hidden variable independence helps a lot in codifying the backpropogation but backpropogation can be codified for softmax as well (Please verify if or not i am correct in this claim. I seem to have gotten the equations right according to me. hence the claim).

Training issue: try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.

I don't understand how you achieve that kind of flexibility even in sigmoid hidden neuron where you can fine tune the activation of a particular given neuron which is precisely what the gradient descent's job is. So why are we even worried about this issue. If you can implement the backprop rest will be taken care of by gradient descent. Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)

mathematical issue: by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.

Kindly explain what is being said here

Batch normalization: I understand this, No issues here

回答1:

1/2. I don't think you have a clue of what the author is trying to say. Imagine a layer with 3 nodes. 2 of these nodes have an error responsibility of 0 with respect to the output error; so there is óne node that should be adjusted. So if you want to improve the output of node 0, then you immediately affect nodes 1 and 2 in that layer - possibly making the output even more wrong.

Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)

That is the definition of backpropagation. That is exactly what you want. Neural networks rely on activations (which are non-linear) to map a function.

3. Your basically saying to every neuron 'hey, your output cannot be higher than x, because some other neuron in this layer already has value y'. Because all neurons in a softmax layer should have a total activation of 1, it means that neurons cannot be higher than a specific value. For small layers - small problem, but for big layers - big problem. Imagine a layer with 100 neurons. Now imagine their total output should be 1. The average value of those neurons will be 0.01 -> that means you are making networks connection relying (because activations will stay very low, averagely) - as other activation functions output (or take on input) of range (0:1 / -1:1).

来源：https://stackoverflow.com/questions/44223902/why-is-softmax-not-used-in-hidden-layers

标签

neural-network

softmax

activation-function