Most examples of neural networks for classification tasks I\'ve seen use the a softmax layer as output activation function. Normally, the other hidden units use a sigmoid, t
Use a softmax activation wherever you want to model a multinomial distribution. This may be (usually) an output layer y, but can also be an intermediate layer, say a multinomial latent variable z. As mentioned in this thread for outputs {o_i}, sum({o_i}) = 1 is a linear dependency, which is intentional at this layer. Additional layers may provide desired sparsity and/or feature independence downstream.
Page 198 of Deep Learning (Goodfellow, Bengio, Courville)
Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable. Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. More rarely, softmax functions can be used inside the model itself, if we wish the model to choose between one of n different options for some internal variable.