Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

岁酱吖の 提交于 2019-11-27 15:06:17

xor'ing a ymm register with itself generates two micro-ops on AMD Ryzen, while xor'ing an xmm register with itself generates only one micro-op. So the optimal way of xeroing a ymm register is to xor the corresponding xmm register with itself and rely on implicit zero extension.

The only processor that supports AVX512 today is Knights Landing. It uses a single micro-op for xor'ing a zmm register. It is very common to handle a new extension of vector size by splitting it in two. This happened with the transition from 64 to 128 bits and with the transition from 128 to 256 bits. It is more than likely that some processors in the future (from AMD or Intel or any other vendor) will split 512-bit vectors into two 256-bit vectors or even four 128-bit vectors. So the optimal way to zero a zmm register is to xor the 128-bit register with itself and rely on zero extension. And you are right, the 128-bit VEX-coded instruction is one or two bytes shorter.

Most processors recognize the xor of a register with itself to be independent of the previous value of the register.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!