Accessing three static arrays is quicker than one static array containing 3x data?

问题

I have 700 items and I loop through the 700 items for each I obtain the item' three attributes and perform some basic calculations. I have implemented this using two techniques:

1) Three 700-element arrays, one array for each of the three attributes. So:

 item0.a = array1[0]
 item0.b = array2[0]
 item0.e = array3[0]

2) One 2100-element array containing data for the three attributes consecutively. So:

 item0.a = array[(0*3)+0]
 item0.b = array[(0*3)+1]
 item0.e = array[(0*3)+2]

Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique (due to spatial locality). However:

Three 700-element arrays = 3300 CPU cycles on average for the whole loop
One 2100-element array = 3500 CPU cycles on average for the whole loop

Here is the code for the 2100-array technique:

unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];

//I have left out code for simplicity. You can assume by now the array is populated.

start = __rdtscp(&x);

for(int i=0; i < 700; i++){
    unsigned short j = i * 3;
    unsigned int a = array[j + 0];     
    unsigned int b = array[j + 1];     

    data_for_all_items = data_for_all_items & (a!= -1 & b != -1);

    unsigned int e = array[j + 2];    

    c += (a * e);
    d += (b * e);
}

finish = __rdtscp(&y);

and here is the code for the three 700-element arrays technique:

unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];

//I have left out code for simplicity. You can assume by now the arrays are populated.

start = __rdtscp(&x);

for(int i=0; i < 700; i++){
    unsigned int a= array1[i];     //Array 1
    unsigned int b= array2[i];     //Array 2

    data_for_all_items = data_for_all_items & (a!= -1 & b != -1);

    unsigned int e = array3[i];    //Array 3

    c += (a * e);
    d += (b * e);
}

finish = __rdtscp(&y);

Why isn't the technique using one-2100 element array faster? It should be because the three attributes are used together, per each 700 item.

I used MSVC 2012, Win 7 64

Assembly for 3x 700-element array technique:

            start = __rdtscp(&x);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[this]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx  
 mov         r8d,8ch  
 mov         r9,rax  
 lea         rdx,[rbx+0Ch]  

            for(int i=0; i < 700; i++){
 sub         rdi,rbx  
                unsigned int a = array1[i];
                unsigned int b = array2[i];

                data_for_all_items = data_for_all_items & (a != -1 & b != -1);
 cmp         dword ptr [rdi+rdx-0Ch],0FFFFFFFFh  
 lea         rdx,[rdx+14h]  
 setne       cl  
 cmp         dword ptr [rdi+rdx-1Ch],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdi+rdx-18h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdi+rdx-10h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdi+rdx-14h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-20h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-1Ch],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-18h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-10h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-14h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 and         r15b,cl  
 dec         r8  
 jne         013F26DA53h  

                unsigned int e = array3[i];

                c += (a * e);
                d += (b * e);
            }


            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[y]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx

Assembler for the 2100-element array technique:

            start = __rdtscp(&x);
 rdtscp  
 lea         r8,[this]  
 shl         rdx,20h  
 or          rax,rdx  
 mov         dword ptr [r8],ecx  

            for(int i=0; i < 700; i++){
 xor         r8d,r8d  
 mov         r10,rax  
                unsigned short j = i*3;
 movzx       ecx,r8w  
 add         cx,cx  
 lea         edx,[rcx+r8]  
                unsigned int a = array[j + 0];
                unsigned int b = array[j + 1];

                data_for_all_items = data_for_all_items & (best_ask != -1 & best_bid != -1);
 movzx       ecx,dx  
 cmp         dword ptr [r9+rcx*4+4],0FFFFFFFFh  
 setne       dl  
 cmp         dword ptr [r9+rcx*4],0FFFFFFFFh  
 setne       al  
 inc         r8d  
 and         dl,al  
 and         r14b,dl  
 cmp         r8d,2BCh  
 jl          013F05DA10h  

                unsigned int e = array[pos + 2];

                c += (a * e);
                d += (b * e);
            }


            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[y]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx

回答1:

Edit: Given your assembly code, the second loop is five times unrolled. The unrolled version could run faster on an out-of-order execution CPU such as any modern x86/x86-64 CPU.

The second code is vectorisable - two elements of each array could be loaded at each iteration in one XMM register each. Since modern CPUs use SSE for both scalar and vector FP arithmetic, this cuts the number of cycles roughly in half. With an AVX-capable CPU four doubles could be loaded in an YMM register and therefore the number of cycles should be cut in four.

The first loop is not vectorisable along i since the value of a in iteration i+1 comes from a location 3 elements after the one where the value of a in iteration i comes from. In that case vectorisation requires gathered vector loads are those are only supported in the AVX2 instruction set.

Using proper data structures is crucial when programming CPUs with vector capabilities. Converting codes like your first loop into something like your second loop is 90% of the job that one has to do in order to get good performance on Intel Xeon Phi which has very wide vector registers but awfully slow in-order execution engine.

回答2:

The simple answer is that version 1 is SIMD friendly and version 2 is not. However, it's possible to make version 2, the 2100 element array, SIMD friendly. You need to us a Hybrid Struct of Arrays, aka an Array of Struct of Arrays (AoSoA). You arrange the array like this: aaaa bbbb eeee aaaa bbbb eeee ....

Below is code using GCC's vector extensions to do this. Note that now the 2100 element array code looks almost the same as the 700 element array code but it uses one array instead of three. And instead of having 700 elements between a b and e there are only 12 elements between them.

I did not find an easy solution to convert uint4 to double4 with the GCC vector extensions and I don't want to spend the time to write intrinics to do this right now so I made c and v unsigned int but for performance I would not want to be converting uint4 to double 4 in a loop anyway.

typedef unsigned int uint4 __attribute__ ((vector_size (16)));
//typedef double double4 __attribute__ ((vector_size (32)));  
uint4 zero = {};
unsigned int array[2100];
uint4 test = -1 + zero;
//double4 cv = {};
//double4 dv = {};
uint4 cv = {};
uint4 dv = {};

uint4* av = (uint4*)&array[0];     
uint4* bv = (uint4*)&array[4];
uint4* ev = (uint4*)&array[8];  

for(int i=0; i < 525; i+=3) { //525 = 2100/4 = 700/4*3
    test = test & ((av[i]!= -1) & (bv[i] != -1));     
    cv += (av[i] * ev[i]);
    dv += (bv[i] * ev[i]);
}
double c = cv[0] + cv[1] + cv[2] + cv[3];
double v = dv[0] + dv[1] + dv[2] + dv[3];
bool data_for_all_items = test[0] & test[1] & test[2] & test[3];

回答3:

The concept of 'spatial locality' is throwing you off a little bit. Chances are that with both solutions, your processor is doing its best to cache the arrays.

Unfortunately, version of your code that uses one array also has some extra math which is being performed. This is probably where your extra cycles are being spent.

回答4:

Spatial locality is indeed useful, but it's actually helping you on the second case (3 distinct arrays) much more.

The cache line size is 64 Bytes (note that it doesn't divide in 3), so a single access to a 4 or 8 byte value is effectively prefetching the next elements. In addition, keep in mind that the CPU HW prefetcher is likely to go on and prefetch ahead even further elements.

However, when a,b,e are packed together, you're "wasting" this valuable prefetching on elements of the same iteration. When you access a, There's no point in prefetching b and e - the next loads are already going there (and would likely just merge in the CPU with the first load or wait for it to retrieve the data). In fact, when the arrays are merged - you fetch a new memory line only once per 64/(3*4)=~5.3 iterations. The bad alignment even means that on some iterations you'll have a and maybe b long before you get e, this imbalance is usually bad news.

In reality, since the iterations are independent, your CPU would go ahead and start the second iteration relatively fast thanks to the combination of loop unrolling (in case it was done) and out-of-order execution (calculating the index for the next set of iterations is simple and has no dependencies on the loads sent by the last ones). However you would have to run ahead pretty far in order to issue the next load everytime, and eventually the finite size of CPU instruction queues will block you, maybe before reaching the full potential memory bandwidth (number of parallel outstanding loads).

The alternative option on the other hand, where you have 3 distinct arrays, uses the spatial locality / HW prefetching solely across iterations. On each iteration, you'll issue 3 loads, which would fetch a full line once every 64/4=16 iterations. The overall data fetched is the same (well, it's the same data), but the timeliness is much better because you fetch ahead for the next 16 iterations instead of the 5. The difference become even bigger when HW prefetching is involved because you have 3 streams instead of one, meaning you can issue more prefetches (and look even further ahead).

来源：https://stackoverflow.com/questions/23162398/accessing-three-static-arrays-is-quicker-than-one-static-array-containing-3x-dat

标签

c++

arrays

performance

optimization

cpu