SIMD instructions lowering CPU frequency

后端未结

关注

 2  1457

情歌与酒 2020-11-29 22:52

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core,

2条回答

夕颜 (楼主)

2020-11-29 23:19

It's not the instruction mnemonic that matters, it's 512-bit vector width at all that matters.

You can use the 256-bit version of AVX-512VL instructions, e.g. vpternlogd ymm0, ymm1, ymm2 without incurring the AVX-512 turbo penalty.

Related: Dynamically determining where a rogue AVX-512 instruction is executing is about a case where one AVX-512 instruction in glibc init code or something left a dirty upper ZMM that gimped max turbo for the rest of the process lifetime. (Or until a vzeroupper maybe)

Although there can be other turbo impacts from light / heavy use of 256-bit FP math instructions, and some of that is due to heat. But usually 256-bit is worth it on modern CPUs.

Anyway, this is why gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256. For any given workload, it's worth trying -mprefer-vector-width=512 and maybe also 128, depending on how much or how little of the work can usefully auto-vectorize.

Tell GCC to tune for your CPU (e.g. -march=native) and it will hopefully make good choices. Although on a desktop Skylake-X, the turbo penalty is smaller than a Xeon. And if your code does actually benefit from 512-bit vectorization, it can be worth it to pay the penalty.

(Also beware the other major effect of Skylake-family CPUs going into 512-bit vector mode: the vector ALUs on port 1 shut down, so only scalar instructions like popcnt or add can use port 1. So vpand and vpaddb etc. throughput drops from 3 to 2 per clock. And if you're on an SKX with two 512-bit FMA units, the extra one on port 5 powers up, so then FMAs compete with shuffles.)

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...