If you do logistic regression for example, you will use the sigmoid function to estimate de probability, the cross entropy as the loss function and gradient descent to minimize it. Doing this but using MSE as the loss function might lead to a non-convex problem where you might find local minima. Using cross entropy will lead to a convex problem where you might find the optimum solution.
https://www.youtube.com/watch?v=rtD0RvfBJqQ&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=35
There is also an interesting analysis here:
https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/