Below are two programs that are almost identical except that I switched the i and j variables around. They both run in different amounts of time. C
I try to give a generic answer.
Because i[y][x] is a shorthand for *(i + y*array_width + x) in C (try out the classy int P[3]; 0[P] = 0xBEEF;).
As you iterate over y, you iterate over chunks of size array_width * sizeof(array_element). If you have that in your inner loop, then you will have array_width * array_height iterations over those chunks.
By flipping the order, you will have only array_height chunk-iterations, and between any chunk-iteration, you will have array_width iterations of only sizeof(array_element).
While on really old x86-CPUs this did not matter much, nowadays' x86 do a lot of prefetching and caching of data. You probably produce many cache misses in your slower iteration-order.