In 32-bit code, mov ax, bx needs an operand-size prefix, whereas byte-sized moves don't. Apparently modern processor designers do not spend much effort at getting the operand-size prefix to decode quickly, though it surprises me that the penalty would be enough to do two byte-sized moves instead.