Why does Clang do this optimization trick only from Sandy Bridge onward?

微笑、不失礼 提交于 2019-11-29 13:43:53

I'm guessing that clang devs tested which uarches it was good on, and found it was only SnB-family.

That sounds right, because of a funky stall on P6-family, and AMD's different dividers.


Using the flag result from a shift imm8 (not a shift-by-implicit-1) on P6-family causes the front-end to stall before issuing the flag-reading instruction until the shift is retired. (Because the P6 decoders don't check for the imm8=0 case for leaving flags unmodified, while SnB does). INC instruction vs ADD 1: Does it matter?. That might be why clang doesn't use it for P6-family.

Probably a different way of checking the relevant condition that didn't cause this stall (like a test rcx,rcx before the je, would make it worth it on Core2/Nehalem). But if clang devs didn't realize the reason it was slow on P6-family, they wouldn't have thought to fix it, and just left it not done for pre-SnB targets. (Nobody added me to a patch review or bug CC list about this one, unfortunately; this is the first I've seen of clang doing this optimization. Although I think I might have mentioned shift flag stalls in comments on some other LLVM review or bug. Anyway, might be fun to try adding a test and see if that makes it worthwhile on Nehalem.)


AMD's dividers have the same best-case div performance regardless of operand-size, presumably depending only on the actual magnitude of the inputs, according to Agner Fog. Only the worst-case grows with operand-size. So I think it's harmless to run idiv r64 with small inputs sign-extended to 128 / 64-bit on AMD. (div/idiv on AMD is 2 uops for all operand sizes (except 8-bit where it's one because it only has to write one output register: AH and AL = AX. Unlike Intel's microcoded integer division.)

Intel is very different: idiv r32 is 9 uops, vs. idiv r64 being 59 uops, with a best-case throughput that's 3x worse, on Haswell. Other members of SnB-family are similar.

Why don't other compilers like GCC or ICC do it?

Probably because clang developers thought of it, and gcc/icc haven't copied them yet. If you've watched Chandler Carruth's talks about perf, one example he used was playing around with a branch to skip a div. I'm guessing this optimization was his idea. Looks nifty. :)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!