What accounts for the added execution time of the first data set? The assembly instructions are the same.
With DN_FLUSH flag not on, the first data set takes 63 m
Another quote from the Intel manuals, volume 1, chapter 10.2.3.3:
The flush-to-zero mode is not compatible with IEEE Standard 754. The IEEE mandated masked response to underflow is to deliver the denormalized result (see Section 4.8.3.2, “Normalized and Denormalized Finite Numbers”). The flush-to-zero mode is provided primarily for performance reasons. At the cost of a slight precision loss, faster execution can be achieved for applications where underflows are common and rounding the underflow result to zero can be tolerated.