Significant FMA performance anomaly experienced in the Intel Broadwell processor

前端 未结 2 909
北海茫月
北海茫月 2020-12-05 00:38
  • Code1:

    vzeroall
    mov             rcx, 1000000
    startLabel1:
    vfmadd231ps     ymm0, ymm0, ymm0
    vfmadd231ps     ymm1, ymm1, ymm1
    vfmadd231ps     ymm2, ym         
    
    
            
2条回答
  •  一生所求
    2020-12-05 00:51

    Update: previous version contained a 6 VPADDD instructions (vs 5 in the question), and the extra VPADDD caused imbalance on Broadwell. After it was fixed, Haswell, Broadwell and Skylake issue almost the same number of uops to ports 0, 1 and 5.

    There is no port contamination, but uops are scheduled suboptimally, with the majority of uops going to Port 5 on Broadwell, and making it the bottleneck before Ports 0 and 1 are saturated.

    To demonstrate what is going on, I suggest to (ab)use the demo on PeachPy.IO:

    1. Open www.peachpy.io in Google Chrome (it wouldn't work in other browsers).

    2. Replace the default code (which implements SDOT function) with the code below, which is literally your example ported to PeachPy syntax:

      n = Argument(size_t)
      x = Argument(ptr(const_float_))
      incx = Argument(size_t)
      y = Argument(ptr(const_float_))
      incy = Argument(size_t)
      
      with Function("sdot", (n, x, incx, y, incy)) as function:
          reg_n = GeneralPurposeRegister64()
          LOAD.ARGUMENT(reg_n, n)
      
          VZEROALL()
      
          with Loop() as loop:
              for i in range(15):
                  ymm_i = YMMRegister(i)
                  if i < 10:
                      VFMADD231PS(ymm_i, ymm_i, ymm_i)
                  else:
                      VPADDD(ymm_i, ymm_i, ymm_i)
              DEC(reg_n)
              JNZ(loop.begin)
      
          RETURN()
      
    3. I have a number of machines on different microarchitectures as a backend for PeachPy.io. Choose Intel Haswell, Intel Broadwell, or Intel Skylake and press "Quick Run". The system will compile your code, upload it to server, and visualize performance counters collected during execution.

    4. Here is the uops distribution over execution ports on Intel Haswell:

    1. And here is the same plot from Intel Broadwell:

    1. Apparently, whatever was the flaw in uops scheduler, it was fixed in Intel Skylake, because port pressure on that machine is the same as on Haswell.

提交回复
热议问题