After searching a long time for a performance bug, I read about denormal floating point values.
Apparently denormalized floating-point values can be a major performa
To have (flush-to-zero) FTZ (assuming underflow is masked by default) in gcc:
#define CSR_FLUSH_TO_ZERO (1 << 15)
unsigned csr = __builtin_ia32_stmxcsr();
csr |= CSR_FLUSH_TO_ZERO;
__builtin_ia32_ldmxcsr(csr);
In case it's not obvious from the names, __builtin_ia32_stmxcsr and __builtin_ia32_ldmxcsr are available only if you're targeting a x86 processor. ARM, Sparc, MIPS, etc. will each need separate platform-specific code with this approach.