SIMD code runs slower than scalar code

喜夏-厌秋 提交于 2019-11-30 15:17:42

Disclaimer: I come from a PowerPC background, so what I'm saying here might be complete hogwash. But you're stalling your vector pipeline since you try to access your results right away.

It is best to keep everything in your vector pipeline. As soon as you do any kind of conversion from vector to int or float, or storing the result into memory, you're stalling.

The best mode of operation when dealing with SSE or VMX is: Load, process, store. Load the data into your vector registers, do all the vector processing, then store it to memory.

I would recommend: Reserve several __m128i registers, unroll your loop several times, then store it.

EDIT: Also, if you unroll, and if you align res1 and res2 by 16 bytes, you can store your results directly in memory without going through this simdstore indirection, which is probably a LHS and another stall.

EDIT: Forgot the obvious. If your polylen is typically large, don't forget to do a data cache prefetch on every iteration.

The way your code looks res1 and res2 seem to be totally independent vectors. Yet you mix them in the same register to xor them.

I would use different registers something like the following (The vectors must all be aligned).

__m128i x0, x1, x2, x3;  
for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k+=2)  
    {     
        //res1[i + k] ^= _mulpre1[u1][k];
        x0= _mm_load_si128(&_mulpre1[u1][k]);
        x1= _mm_load_si128(&res1[i + k]);
        x0= _mm_xor_si128 (x0, x1);
        _mm_store_si128 (&res1[i + k], x0);
        //res2[i + k] ^= _mulpre2[u2][k];               
        x2= _mm_load_si128(&_mulpre2[u2][k]);
        x3= _mm_load_si128(&res2[i + k]);
        x2= _mm_xor_si128 (x2, x3);
        _mm_store_si128 (&res2[i + k], x2);
   }     
}  

Note that I am using only 4 registers. You can manually unroll to use all 8 registers in x86 or more in x86_64

You're doing very little computation here relative to the number of loads and stores that are being performed, so as a result you are seeing little benefit from SIMD. You will probably find it more useful in this case to use scalar code, particularly if you have an x86-64 CPU that you can use in 64-bit mode. This will reduce the number of loads and stores, which are currently the dominant factor in your performance.

(Note: you probably should NOT unroll the loop, especially if you're using Core 2 or newer.)

I'm no SIMD expert either, but it looks like you could also benefit from prefetching the data, coupled with the derolling EboMike mentioned. Might also help if you merged res1 and res2 into one aligned array(of structs depending on what else uses it), then you don't need the extra copying, you can operate directly on it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!