Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

前端 未结 1 1396
忘掉有多难
忘掉有多难 2020-12-01 21:37

AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half th

相关标签:
1条回答
  • 2020-12-01 22:24

    xor'ing a ymm register with itself generates two micro-ops on AMD Ryzen, while xor'ing an xmm register with itself generates only one micro-op. So the optimal way of xeroing a ymm register is to xor the corresponding xmm register with itself and rely on implicit zero extension.

    The only processor that supports AVX512 today is Knights Landing. It uses a single micro-op for xor'ing a zmm register. It is very common to handle a new extension of vector size by splitting it in two. This happened with the transition from 64 to 128 bits and with the transition from 128 to 256 bits. It is more than likely that some processors in the future (from AMD or Intel or any other vendor) will split 512-bit vectors into two 256-bit vectors or even four 128-bit vectors. So the optimal way to zero a zmm register is to xor the 128-bit register with itself and rely on zero extension. And you are right, the 128-bit VEX-coded instruction is one or two bytes shorter.

    Most processors recognize the xor of a register with itself to be independent of the previous value of the register.

    0 讨论(0)
提交回复
热议问题