C++: Mysteriously huge speedup from keeping one operand in a register

后端 未结 3 2261
被撕碎了的回忆
被撕碎了的回忆 2020-12-22 19:39

I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the foll

3条回答
  •  眼角桃花
    2020-12-22 20:33

    I would surmise the problem isn't in the cache/memory access but in the processor (execution of your code). There are several visible bottlenecks here.

    Performance numbers here were based on the boxes I was using (either sandybridge or westmere)

    Peak performance for scalar math is 2.7Ghz x2 FLOPS/Clock x2 since processor can do an add and multiply simultaneously. Theoretical efficiency of the code is 0.6/(2.7*2) = 11%

    Bandwidth needed: 2 doubles per (+) and (x) -> 4bytes/Flop 4 bytes * 5.4GFLOPS = 21.6GB/s

    If you know it was read recently its likely in L1 (89GB/s), L2 (42GB/s) or L3(24GB/s) so we can rule out cache B/W

    The memory susbsystem is 18.9 GB/s so even in main memory, peak performance should approach 18.9/21.6GB/s = 87.5 %

    • May want to batch up requests (via unrolling) as early as possible

    Even with speculative execution, tot += a *X[i] the adds will be serialized because tot(n) need to be eval'd before tot(n+1) can be kicked off

    First unroll loop
    move i by 8's and do

    {//your func
        for( int i = 0; i < size; i += 8 ){
            tot += a * X[i];
            tot += a * X[i+1];
            ...
            tot += a * X[i+7];
        }
        return tot
    }
    

    Use multiple accumulators
    This will break dependencies and allow us to avoid stalling on the addition pipeline

    {//your func//
        int tot,tot2,tot3,tot4;
        tot = tot2 = tot3 = tot4 = 0
        for( int i = 0; i < size; i += 8 ) 
            tot  += a * X[i];
            tot2 += a * X[i+1];
            tot3 += a * X[i+2];
            tot4 += a * X[i+3];
            tot  += a * X[i+4];
            tot2 += a * X[i+5];
            tot3 += a * X[i+6];
            tot4 += a * X[i+7];
        }
        return tot + tot2 + tot3 + tot4;
    }
    

    UPDATE After running this on a SandyBridge box I have access to: (2.7GHZ SandyBridge with -O2 -march=native -mtune=native

    Original code:

    Operand size: 2048  
    Vector size 2048: mflops=2206.2, result=61.8  
    2.206 / 5.4 = 40.8%
    

    Improved Code:

    Operand size: 2048  
    Vector size 2048: mflops=5313.7, result=61.8  
    5.3137 / 5.4 = 98.4%  
    

提交回复
热议问题