Is there a really working example which showing the benefits of ILP(Instruction-Level Parallelism) on x86_64?

后端未结

关注

 2  1175

无人共我 2020-12-20 17:16

As known CPU is pipeline, and it works most efficiently if the sequence of commands independent from each other - this known as ILP (Instruction-Level Parallelism): http://e

2条回答

慢半拍i (楼主)

2020-12-20 17:40

On most Intel processors, it takes 3 cycles to do a floating-point add. But it can sustain up to 1/cycle if they are independent.

We can easily demonstrate ILP by putting a floating-point add on the critical path.

Environment:

GCC 4.8.2: -O2
Sandy Bridge Xeon

Make sure that the compiler does not do unsafe floating-point optimizations.

#include 
using namespace std;

#include 

const int iterations = 1000000000;

double sequential(){
    double a = 2.3;
    double result = 0;

    for (int c = 0; c < iterations; c += 4){
        //  Every add depends on the previous add. No ILP is possible.
        result += a;
        result += a;
        result += a;
        result += a;
    }

    return result;
}
double optimized(){
    double a = 2.3;
    double result0 = 0;
    double result1 = 0;
    double result2 = 0;
    double result3 = 0;

    for (int c = 0; c < iterations; c += 4){
        //  4 independent adds. Up to 4 adds can be run in parallel.
        result0 += a;
        result1 += a;
        result2 += a;
        result3 += a;
    }

    return result0 + result1 + result2 + result3;
}

int main(){

    clock_t start0 = clock();
    double sum0 = sequential();
    clock_t end0 = clock();
    cout << "sum = " << sum0 << endl;
    cout << "sequential time: " << (double)(end0 - start0) / CLOCKS_PER_SEC << endl;

    clock_t start1 = clock();
    double sum1 = optimized();
    clock_t end1 = clock();
    cout << "sum = " << sum1 << endl;
    cout << "optimized time:  " << (double)(end1 - start1) / CLOCKS_PER_SEC << endl;

}

Output:

sum = 2.3e+09
sequential time: 0.948138
sum = 2.3e+09
optimized time:  0.317293

Notice how the difference is almost exactly 3x. That's because of the 3-cycle latency and 1-cycle throughput of the floating-point add.

The sequential version has very little ILP because all the floating-point adds are on the critical path. (each add needs to wait until the previous add is done) The unrolled version has 4 separate dependency chains with up to 4 independent adds - all of which can be run in parallel. Only 3 are required to saturate the processor core.

0 讨论(0)

查看其它2个回答