Slower SSE performance on large array sizes
I am new to SSE programming so I am hoping someone out there can help me. I recently implemented a function using GCC SSE intrinsics to compute the sum of an array of 32-bit integers. The code for my implementation is given below. int ssum(const int *d, unsigned int len) { static const unsigned int BLOCKSIZE=4; unsigned int i,remainder; int output; __m128i xmm0, accumulator; __m128i* src; remainder = len%BLOCKSIZE; src = (__m128i*)d; accumulator = _mm_loadu_si128(src); output = 0; for(i=BLOCKSIZE;i<len-remainder;i+=BLOCKSIZE){ xmm0 = _mm_loadu_si128(++src); accumulator = _mm_add_epi32