SIMD code runs slower than scalar code

elma and elmc are both unsigned long arrays. So are res1 and res2.

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {
        //res1[i + k] ^= _mulpre1[u1][k];  
        //res2[i + k] ^= _mulpre2[u2][k];               

        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdc = _mm_xor_si128 (simda, simdb);  
        _mm_store_si128 (p, simdc);  
        res1[i + k] = simdstore[0];  
        res2[i + k] = simdstore[1];                     
    }     
}

Within the for loop is included both the non-simd and simd version of XOR of elements. First two lines within the second for loop do the explicit XOR, whereas the rest implements the simd version of the same operation.

This loop is called from outside hundreds of times, so optimizing this loop will help bring down the total computation time.

The problem is simd code runs many times slower than the scalar code.

EDIT: Done partial unrolling

__m128i *p1, *p2, *p3, *p4;  
p1 = (__m128i *) simdstore1;  
p2 = (__m128i *) simdstore2;  
p3 = (__m128i *) simdstore3;  
p4 = (__m128i *) simdstore4;  

for (i = 0; i < 20; i++)  
{
    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k = k + 4)  
    {
        simda1  = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);  
        simda2  = _mm_set_epi64x (_mulpre2[u2][k + 1], _mulpre1[u1][k + 1]);  
        simda3  = _mm_set_epi64x (_mulpre2[u2][k + 2], _mulpre1[u1][k + 2]);  
        simda4  = _mm_set_epi64x (_mulpre2[u2][k + 3], _mulpre1[u1][k + 3]);  

        simdb1  = _mm_set_epi64x (res2[i + k], res1[i + k]);  
        simdb2  = _mm_set_epi64x (res2[i + k + 1], res1[i + k + 1]);  
        simdb3  = _mm_set_epi64x (res2[i + k + 2], res1[i + k + 2]);  
        simdb4  = _mm_set_epi64x (res2[i + k + 3], res1[i + k + 3]);  

        simdc1  = _mm_xor_si128 (simda1, simdb1);  
        simdc2  = _mm_xor_si128 (simda2, simdb2);  
        simdc3  = _mm_xor_si128 (simda3, simdb3);  
        simdc4  = _mm_xor_si128 (simda4, simdb4);  

        _mm_store_si128 (p1, simdc1);  
        _mm_store_si128 (p2, simdc2);  
        _mm_store_si128 (p3, simdc3);  
        _mm_store_si128 (p4, simdc4);  

        res1[i + k]= simdstore1[0];  
        res2[i + k]= simdstore1[1]; 
        res1[i + k + 1]= simdstore2[0];  
        res2[i + k + 1]= simdstore2[1];   
        res1[i + k + 2]= simdstore3[0];  
        res2[i + k + 2]= simdstore3[1]; 
        res1[i + k + 3]= simdstore4[0];  
        res2[i + k + 3]= simdstore4[1];   
    }  
}

But, the result does not change much; it still takes twice as long as scalar code.

Disclaimer: I come from a PowerPC background, so what I'm saying here might be complete hogwash. But you're stalling your vector pipeline since you try to access your results right away.

It is best to keep everything in your vector pipeline. As soon as you do any kind of conversion from vector to int or float, or storing the result into memory, you're stalling.

The best mode of operation when dealing with SSE or VMX is: Load, process, store. Load the data into your vector registers, do all the vector processing, then store it to memory.

I would recommend: Reserve several __m128i registers, unroll your loop several times, then store it.

EDIT: Also, if you unroll, and if you align res1 and res2 by 16 bytes, you can store your results directly in memory without going through this simdstore indirection, which is probably a LHS and another stall.

EDIT: Forgot the obvious. If your polylen is typically large, don't forget to do a data cache prefetch on every iteration.

The way your code looks res1 and res2 seem to be totally independent vectors. Yet you mix them in the same register to xor them.

I would use different registers something like the following (The vectors must all be aligned).

__m128i x0, x1, x2, x3;  
for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k+=2)  
    {     
        //res1[i + k] ^= _mulpre1[u1][k];
        x0= _mm_load_si128(&_mulpre1[u1][k]);
        x1= _mm_load_si128(&res1[i + k]);
        x0= _mm_xor_si128 (x0, x1);
        _mm_store_si128 (&res1[i + k], x0);
        //res2[i + k] ^= _mulpre2[u2][k];               
        x2= _mm_load_si128(&_mulpre2[u2][k]);
        x3= _mm_load_si128(&res2[i + k]);
        x2= _mm_xor_si128 (x2, x3);
        _mm_store_si128 (&res2[i + k], x2);
   }     
}

Note that I am using only 4 registers. You can manually unroll to use all 8 registers in x86 or more in x86_64

You're doing very little computation here relative to the number of loads and stores that are being performed, so as a result you are seeing little benefit from SIMD. You will probably find it more useful in this case to use scalar code, particularly if you have an x86-64 CPU that you can use in 64-bit mode. This will reduce the number of loads and stores, which are currently the dominant factor in your performance.

(Note: you probably should NOT unroll the loop, especially if you're using Core 2 or newer.)

I'm no SIMD expert either, but it looks like you could also benefit from prefetching the data, coupled with the derolling EboMike mentioned. Might also help if you merged res1 and res2 into one aligned array(of structs depending on what else uses it), then you don't need the extra copying, you can operate directly on it.

来源：https://stackoverflow.com/questions/4394930/simd-code-runs-slower-than-scalar-code

标签

optimization

sse

simd

sse2