A matrix, in memory, is align linearly, such that the items in a row are next to each other in memory (spacial locality). When you transverse items in order such that you go through all of the columns in a row before moving onto the next one, when the CPU comes across an entry that isn't loaded into its cache yet, it will go and load that value along with a whole block of other values close to it in physical memory so the next several values will already be cached by the time it needs to read them.
When you transverse them the other way, the other values it loads in that are near it in memory are not going to be the next ones read, so you wind up with a lot more cache misses and so the CPU has to sit and wait while the data is brought in from the next layer of the memory hierarchy.
By the time you swing back around to another entry that you had previously cached, it more than likely has been booted out of the cache in favor of all the other data you've since loaded in as it will not have been recently used anymore (temporal locality)