Is there a really working example which showing the benefits of ILP(Instruction-Level Parallelism) on x86_64?

后端 未结 2 1174
无人共我
无人共我 2020-12-20 17:16

As known CPU is pipeline, and it works most efficiently if the sequence of commands independent from each other - this known as ILP (Instruction-Level Parallelism): http://e

相关标签:
2条回答
  • 2020-12-20 17:40

    On most Intel processors, it takes 3 cycles to do a floating-point add. But it can sustain up to 1/cycle if they are independent.

    We can easily demonstrate ILP by putting a floating-point add on the critical path.


    Environment:

    • GCC 4.8.2: -O2
    • Sandy Bridge Xeon

    Make sure that the compiler does not do unsafe floating-point optimizations.

    #include <iostream>
    using namespace std;
    
    #include <time.h>
    
    const int iterations = 1000000000;
    
    double sequential(){
        double a = 2.3;
        double result = 0;
    
        for (int c = 0; c < iterations; c += 4){
            //  Every add depends on the previous add. No ILP is possible.
            result += a;
            result += a;
            result += a;
            result += a;
        }
    
        return result;
    }
    double optimized(){
        double a = 2.3;
        double result0 = 0;
        double result1 = 0;
        double result2 = 0;
        double result3 = 0;
    
        for (int c = 0; c < iterations; c += 4){
            //  4 independent adds. Up to 4 adds can be run in parallel.
            result0 += a;
            result1 += a;
            result2 += a;
            result3 += a;
        }
    
        return result0 + result1 + result2 + result3;
    }
    
    int main(){
    
        clock_t start0 = clock();
        double sum0 = sequential();
        clock_t end0 = clock();
        cout << "sum = " << sum0 << endl;
        cout << "sequential time: " << (double)(end0 - start0) / CLOCKS_PER_SEC << endl;
    
        clock_t start1 = clock();
        double sum1 = optimized();
        clock_t end1 = clock();
        cout << "sum = " << sum1 << endl;
        cout << "optimized time:  " << (double)(end1 - start1) / CLOCKS_PER_SEC << endl;
    
    }
    

    Output:

    sum = 2.3e+09
    sequential time: 0.948138
    sum = 2.3e+09
    optimized time:  0.317293
    

    Notice how the difference is almost exactly 3x. That's because of the 3-cycle latency and 1-cycle throughput of the floating-point add.

    The sequential version has very little ILP because all the floating-point adds are on the critical path. (each add needs to wait until the previous add is done) The unrolled version has 4 separate dependency chains with up to 4 independent adds - all of which can be run in parallel. Only 3 are required to saturate the processor core.

    0 讨论(0)
  • 2020-12-20 17:43

    The difference can also be made visible for integer code, for example

    global cmp1
    proc_frame cmp1
    [endprolog]
        mov ecx, -10000000
        mov r8d, 1
        xor eax, eax
    _cmp1_loop:
        add eax, r8d
        add eax, r8d
        add eax, r8d
        add eax, r8d
        add ecx, 1
        jnz _cmp1_loop
        ret
    endproc_frame
    
    global cmp2
    proc_frame cmp2
    [endprolog]
        mov ecx, -10000000
        mov r8d, 1
        xor eax, eax
        xor edx, edx
        xor r9d, r9d
        xor r10d, r10d
    _cmp2_loop:
        add eax, r8d
        add edx, r8d
        add r9d, r8d
        add r10d, r8d
        add ecx, 1
        jnz _cmp2_loop
        add r9d, r10d
        add eax, edx
        add eax, r9d
        ret
    endproc_frame
    

    Results on my 4770K are about 35.9 million TSC ticks for the first one vs 11.9 million for the second one (minimum time over 1k runs).

    In the first one, the dependency chain on eax is the slowest thing at 4 cycles per iteration. Nothing else matters, the dependency chain on ecx is faster and there is plenty of throughput to hide it and the control flow. 35.9 million TSC ticks works out to 40 million cycles by the way, since the TSC ticks at the base clock rate of 3.5GHz but the max turbo is 3.9GHz, 3.9/3.5 * 35.9 is about 40.

    The version of the second one I mentioned in the comments (4 accumulators but using [rsp] to store the constant 1) takes 17.9, which makes it 2 cycles per iteration. That matches the throughput of the memory loads, which on Haswell is 2/cycle. 4 loads so 2 cycles. The loop overhead can still hide.

    The second one as posted above takes 1.3333 cycles per iteration. The first four adds can go to ports 0, 1, 5 and 6, the add/jnz fused pair can go to port 6 only. Putting the fused pair in p6 leaves 3 ports for 4 µops, hence 1.3333 cycles.

    0 讨论(0)
提交回复
热议问题