I have some big arrays given by MATLAB to C++ (therefore I need to take them as they are) that needs casting and permuting (row-mayor, column mayor issues).
The arr
The problem in this example is cache locality. Looking at the assignment, the fastest-changing index is j but this has the largest effect on the address of the write in the target array:
img[i + k*size_proj[1] + j*size_proj[0] * size_proj[1]] =
Notice that j is multiplied by 2 constants. Every change in j is likely to cause the result to be written to a new cache line.
The solution is this case is to invert the order of the loops:
const auto K = size_proj[0];
const auto I = size_proj[1];
const auto J = size_proj[2];
for (int j = 0; j < J; j++)
for (int i = 0; i < I; i++)
for (int k = 0; k < K; k++)
img[i + k * I + j * K * I] = (float)imgaux[k + i * K + j * K * I];
Which (on my laptop) brings us down from:
Time permuting and casting the input 4.416232
to:
Time permuting and casting the input 0.844341
Which I think you'll agree is something of an improvement.