Given that the time is dropping at larger sizes wouldn't it be more likely to be cache conflicts, especially with powers of 2 for the problematic matrix sizes? I am no expert on caching issues, but excellent info on cache related performance issues here.