After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time.
Before I start looking at parallel alternatives
I'm not sure if the compiler can optimize it in your case but getting out inverse_momentum * (learn_rate * errors[i][j]) to a variable outside to loop "k" in the lower loops might decrease the load on the CPU.
BTW you are profiling a release binary and not a debug one aren't you.