I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:
#include
#in
Just in case a[] b[] and c[] are fighting for the L2 cache ::
#include /* for memcpy */
...
gettimeofday(&stTime, NULL);
for(k = 0; k < LEN; k += 4) {
double a4[4], b4[4], c4[4];
memcpy(a4,a+k, sizeof a4);
memcpy(b4,b+k, sizeof b4);
c4[0] = a4[0] * b4[0];
c4[1] = a4[1] * b4[1];
c4[2] = a4[2] * b4[2];
c4[3] = a4[3] * b4[3];
memcpy(c+k,c4, sizeof c4);
}
gettimeofday(&endTime, NULL);
Reduces the running time from 98429.000000 to 67213.000000; unrolling the loop 8-fold reduces it to 57157.000000 here.