Several good answers are here. It will be good to point out the book "Pattern Recognition and Machine Learning" by Christopher M. Bishop. It is a book worth referring to for getting a deeper insight about several ML related concepts. Excerpt from page 229 (section 5.1):
If the activation functions of all the hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units. This follows from the fact that the composition of successive linear transformations is itself a linear transformation. However, if the number of hidden units is smaller than either the number of input or output units, then the transformations that the network can generate are not the most general possible linear transformations from inputs to outputs because information is lost in the dimensionality reduction at the hidden units. In Section 12.4.2, we show that networks of linear units give rise to principal component analysis. In general, however, there is little interest in multilayer networks of linear units.