Why is icc generating weird assembly for a simple main?

问题

I have a simple program:

int main()
{
    return 2*7;
}

both GCC and clang with optimizations turned on hapily generate 2 instruction binary, but icc gives bizarre output.

     push      rbp                                           #2.1
     mov       rbp, rsp                                      #2.1
     and       rsp, -128                                     #2.1
     sub       rsp, 128                                      #2.1
     xor       esi, esi                                      #2.1
     mov       edi, 3                                        #2.1
     call      __intel_new_feature_proc_init                 #2.1
     stmxcsr   DWORD PTR [rsp]                               #2.1
     mov       eax, 14                                       #3.12
     or        DWORD PTR [rsp], 32832                        #2.1
     ldmxcsr   DWORD PTR [rsp]                               #2.1
     mov       rsp, rbp                                      #3.12
     pop       rbp                                           #3.12
     ret

回答1:

I don't know why ICC chooses to align the stack by 2 cache lines:

and       rsp, -128                                     #2.1
sub       rsp, 128                                      #2.1

That's interesting. L2 cache has an adjacent-line prefetcher that likes to pull pairs of lines (in a 128-byte aligned group) into L2. But main's stack frame is not usually heavily used. Maybe important variables are allocated there in some programs. (This also explains setting up rbp, to save the old RSP so it can return after ANDing. gcc makes stack frames with RBP in functions where it aligns that stack, too.)

The rest is because main() is special, and ICC enables -ffast-math by default. (This is one of Intel's "dirty" little secrets, and lets it auto-vectorize more floating-point code out of the box.)

This includes adding code to the top of main to set the DAZ / FTZ bits in the MXCSR (SSE status / control register). See Intel's x86 manuals for more about these bits, but they're really not complicated:

DAZ: Denormals Are Zero: as inputs to an SSE/AVX instruction, denormals are treated as zero.
FTZ: Flush To Zero: When rounding the result of an SSE/AVX instruction, subnormal results are flushed to zero.

related: SSE "denormals are zeros" option

(ISO C++ forbids a program from calling back into main(), so compilers are allowed to put run-once stuff in main itself instead of in CRT startup files. gcc/clang with -ffast-math specified for linking link in CRT startup files that set the MXCSR. But when compiling with gcc/clang, it only affects code-gen in terms of which optimizations are allowed. i.e. treating FP add/mul as associative, when different temporaries mean it's really not. This is totally unrelated to setting DAZ/FTZ).

Denormal is being used as a synonym for subnormal here: an FP value with the minimum exponent and a significand where the implicit leading bit is 0 instead of 1. i.e. a value with magnitude small than FLT_MIN or DBL_MIN, the smallest representable normalized float/double.

https://en.wikipedia.org/wiki/Denormal_number.

Instructions that produce a subnormal result can be much slower: to optimize for latency, the fast path in some hardware assumes normalized results, and takes a microcode assist if the result can't be normalized. Use perf stat -e fp_assist.any to count such events.

From Bruce Dawson's excellent series of FP articles: That’s Not Normal–the Performance of Odd Floats. Also:

Why does changing 0.1f to 0 slow down performance by 10x?
Avoiding denormal values in C++

Agner Fog has done some testing (see his microarch pdf), and reports for Haswell/Broadwell:

Underflow and subnormals

Subnormal numbers occur when floating point operations are close to underflow. The handling of subnormal numbers is very costly in some cases because the subnormal results are handled by microcode exceptions.

The Haswell and Broadwell have a penalty of approximately 124 clock cycles in all cases where an operation on normal numbers gives a subnormal result. There is a similar penalty for a multiplication between a normal and a subnormal number, regardless of whether the result is normal or subnormal. There is no penalty for adding a normal and a subnormal number, regardless of the result. There is no penalty for overflow, underflow, infinity or not- a-number results.

The penalties for subnormal numbers are avoided if the "flush-to-zero" mode and the "denormals-are-zero" mode are both set in the MXCSR register.

So in some cases, modern Intel CPUs avoid penalties even with subnormals, but

来源：https://stackoverflow.com/questions/52141947/why-is-icc-generating-weird-assembly-for-a-simple-main

标签

c++

assembly

x86

code-generation

icc