As known CPU is pipeline, and it works most efficiently if the sequence of commands independent from each other - this known as ILP (Instruction-Level Parallelism): http://e
On most Intel processors, it takes 3 cycles to do a floating-point add. But it can sustain up to 1/cycle if they are independent.
We can easily demonstrate ILP by putting a floating-point add on the critical path.
Environment:
-O2Make sure that the compiler does not do unsafe floating-point optimizations.
#include
using namespace std;
#include
const int iterations = 1000000000;
double sequential(){
double a = 2.3;
double result = 0;
for (int c = 0; c < iterations; c += 4){
// Every add depends on the previous add. No ILP is possible.
result += a;
result += a;
result += a;
result += a;
}
return result;
}
double optimized(){
double a = 2.3;
double result0 = 0;
double result1 = 0;
double result2 = 0;
double result3 = 0;
for (int c = 0; c < iterations; c += 4){
// 4 independent adds. Up to 4 adds can be run in parallel.
result0 += a;
result1 += a;
result2 += a;
result3 += a;
}
return result0 + result1 + result2 + result3;
}
int main(){
clock_t start0 = clock();
double sum0 = sequential();
clock_t end0 = clock();
cout << "sum = " << sum0 << endl;
cout << "sequential time: " << (double)(end0 - start0) / CLOCKS_PER_SEC << endl;
clock_t start1 = clock();
double sum1 = optimized();
clock_t end1 = clock();
cout << "sum = " << sum1 << endl;
cout << "optimized time: " << (double)(end1 - start1) / CLOCKS_PER_SEC << endl;
}
Output:
sum = 2.3e+09
sequential time: 0.948138
sum = 2.3e+09
optimized time: 0.317293
Notice how the difference is almost exactly 3x. That's because of the 3-cycle latency and 1-cycle throughput of the floating-point add.
The sequential version has very little ILP because all the floating-point adds are on the critical path. (each add needs to wait until the previous add is done) The unrolled version has 4 separate dependency chains with up to 4 independent adds - all of which can be run in parallel. Only 3 are required to saturate the processor core.