Neural Networks are used in pattern recognition. And pattern finding is a very non-linear technique.
Suppose for the sake of argument we use a linear activation function y=wX+b for every single neuron and set something like if y>0 -> class 1 else class 0.
Now we can compute our loss using square error loss and back propagate it so that the model learns well, correct?
WRONG.
For the last hidden layer, the updated value will be w{l} = w{l} - (alpha)*X.
For the second last hidden layer, the updated value will be w{l-1} = w{l-1} - (alpha)*w{l}*X.
For the ith last hidden layer, the updated value will be w{i} = w{i} - (alpha)*w{l}...*w{i+1}*X.
This results in us multiplying all the weight matrices together hence resulting in the possibilities:
A)w{i} barely changes due to vanishing gradient
B)w{i} changes dramatically and inaccurately due to exploding gradient
C)w{i} changes well enough to give us a good fit score
In case C happens that means that our classification/prediction problem was most probably a simple linear/logistic regressor based one and never required a neural network in the first place!
No matter how robust or well hyper tuned your NN is, if you use a linear activation function, you will never be able to tackle non-linear requiring pattern recognition problems