Code1:
vzeroall
mov rcx, 1000000
startLabel1:
vfmadd231ps ymm0, ymm0, ymm0
vfmadd231ps ymm1, ymm1, ymm1
vfmadd231ps ymm2, ym
Update: previous version contained a 6 VPADDD
instructions (vs 5 in the question), and the extra VPADDD
caused imbalance on Broadwell. After it was fixed, Haswell, Broadwell and Skylake issue almost the same number of uops to ports 0, 1 and 5.
There is no port contamination, but uops are scheduled suboptimally, with the majority of uops going to Port 5 on Broadwell, and making it the bottleneck before Ports 0 and 1 are saturated.
To demonstrate what is going on, I suggest to (ab)use the demo on PeachPy.IO:
Open www.peachpy.io in Google Chrome (it wouldn't work in other browsers).
Replace the default code (which implements SDOT function) with the code below, which is literally your example ported to PeachPy syntax:
n = Argument(size_t)
x = Argument(ptr(const_float_))
incx = Argument(size_t)
y = Argument(ptr(const_float_))
incy = Argument(size_t)
with Function("sdot", (n, x, incx, y, incy)) as function:
reg_n = GeneralPurposeRegister64()
LOAD.ARGUMENT(reg_n, n)
VZEROALL()
with Loop() as loop:
for i in range(15):
ymm_i = YMMRegister(i)
if i < 10:
VFMADD231PS(ymm_i, ymm_i, ymm_i)
else:
VPADDD(ymm_i, ymm_i, ymm_i)
DEC(reg_n)
JNZ(loop.begin)
RETURN()
I have a number of machines on different microarchitectures as a backend for PeachPy.io. Choose Intel Haswell, Intel Broadwell, or Intel Skylake and press "Quick Run". The system will compile your code, upload it to server, and visualize performance counters collected during execution.
Here is the uops distribution over execution ports on Intel Haswell: