Is there a really working example which showing the benefits of ILP(Instruction-Level Parallelism) on x86_64?

后端 未结 2 1175
无人共我
无人共我 2020-12-20 17:16

As known CPU is pipeline, and it works most efficiently if the sequence of commands independent from each other - this known as ILP (Instruction-Level Parallelism): http://e

2条回答
  •  慢半拍i
    慢半拍i (楼主)
    2020-12-20 17:40

    On most Intel processors, it takes 3 cycles to do a floating-point add. But it can sustain up to 1/cycle if they are independent.

    We can easily demonstrate ILP by putting a floating-point add on the critical path.


    Environment:

    • GCC 4.8.2: -O2
    • Sandy Bridge Xeon

    Make sure that the compiler does not do unsafe floating-point optimizations.

    #include 
    using namespace std;
    
    #include 
    
    const int iterations = 1000000000;
    
    double sequential(){
        double a = 2.3;
        double result = 0;
    
        for (int c = 0; c < iterations; c += 4){
            //  Every add depends on the previous add. No ILP is possible.
            result += a;
            result += a;
            result += a;
            result += a;
        }
    
        return result;
    }
    double optimized(){
        double a = 2.3;
        double result0 = 0;
        double result1 = 0;
        double result2 = 0;
        double result3 = 0;
    
        for (int c = 0; c < iterations; c += 4){
            //  4 independent adds. Up to 4 adds can be run in parallel.
            result0 += a;
            result1 += a;
            result2 += a;
            result3 += a;
        }
    
        return result0 + result1 + result2 + result3;
    }
    
    int main(){
    
        clock_t start0 = clock();
        double sum0 = sequential();
        clock_t end0 = clock();
        cout << "sum = " << sum0 << endl;
        cout << "sequential time: " << (double)(end0 - start0) / CLOCKS_PER_SEC << endl;
    
        clock_t start1 = clock();
        double sum1 = optimized();
        clock_t end1 = clock();
        cout << "sum = " << sum1 << endl;
        cout << "optimized time:  " << (double)(end1 - start1) / CLOCKS_PER_SEC << endl;
    
    }
    

    Output:

    sum = 2.3e+09
    sequential time: 0.948138
    sum = 2.3e+09
    optimized time:  0.317293
    

    Notice how the difference is almost exactly 3x. That's because of the 3-cycle latency and 1-cycle throughput of the floating-point add.

    The sequential version has very little ILP because all the floating-point adds are on the critical path. (each add needs to wait until the previous add is done) The unrolled version has 4 separate dependency chains with up to 4 independent adds - all of which can be run in parallel. Only 3 are required to saturate the processor core.

提交回复
热议问题