SIMD instructions lowering CPU frequency

后端 未结 2 1457
情歌与酒
情歌与酒 2020-11-29 22:52

I read this article. It talked about why AVX-512 instruction:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core,

2条回答
  •  夕颜
    夕颜 (楼主)
    2020-11-29 23:19

    It's not the instruction mnemonic that matters, it's 512-bit vector width at all that matters.

    You can use the 256-bit version of AVX-512VL instructions, e.g. vpternlogd ymm0, ymm1, ymm2 without incurring the AVX-512 turbo penalty.

    Related: Dynamically determining where a rogue AVX-512 instruction is executing is about a case where one AVX-512 instruction in glibc init code or something left a dirty upper ZMM that gimped max turbo for the rest of the process lifetime. (Or until a vzeroupper maybe)

    Although there can be other turbo impacts from light / heavy use of 256-bit FP math instructions, and some of that is due to heat. But usually 256-bit is worth it on modern CPUs.

    Anyway, this is why gcc -march=skylake-avx512 defaults to -mprefer-vector-width=256. For any given workload, it's worth trying -mprefer-vector-width=512 and maybe also 128, depending on how much or how little of the work can usefully auto-vectorize.

    Tell GCC to tune for your CPU (e.g. -march=native) and it will hopefully make good choices. Although on a desktop Skylake-X, the turbo penalty is smaller than a Xeon. And if your code does actually benefit from 512-bit vectorization, it can be worth it to pay the penalty.

    (Also beware the other major effect of Skylake-family CPUs going into 512-bit vector mode: the vector ALUs on port 1 shut down, so only scalar instructions like popcnt or add can use port 1. So vpand and vpaddb etc. throughput drops from 3 to 2 per clock. And if you're on an SKX with two 512-bit FMA units, the extra one on port 5 powers up, so then FMAs compete with shuffles.)

提交回复
热议问题