L1 memory bandwidth: 50% drop in efficiency using addresses which differ by 4096+64 bytes
I want to achieve the maximum bandwidth of the following operations with Intel processors. for(int i=0; i<n; i++) z[i] = x[i] + y[i]; //n=2048 where x, y, and z are float arrays. I am doing this on Haswell, Ivy Bridge , and Westmere systems. I originally allocated the memory like this char *a = (char*)_mm_malloc(sizeof(float)*n, 64); char *b = (char*)_mm_malloc(sizeof(float)*n, 64); char *c = (char*)_mm_malloc(sizeof(float)*n, 64); float *x = (float*)a; float *y = (float*)b; float *z = (float*)c; When I did this I got about 50% of the peak bandwidth I expected for each system. The peak values