Load constant floats into SSE registers

后端 未结 4 2621
别那么骄傲
别那么骄傲 2021-02-20 04:18

I\'m trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I\'ve tried doing simple code like this,

const __m128 x          


        
4条回答
  •  暖寄归人
    2021-02-20 05:20

    Generating constants is much simpler (and quicker) if the four float constants are the same. For example the bit pattern for 1.f is 0x3f800000. One way this can be generated using SSE2

            register __m128i onef;
            __asm__ ( "pcmpeqb %0, %0" : "=x" ( onef ) );
            onef = _mm_slli_epi32( onef, 25 );
            onef = _mm_srli_epi32( onef, 2 );
    

    Another approach with SSE4.1 is,

            register uint32_t t = 0x3f800000;
            register __m128 onef;
            __asm__ ( "pinsrd %0, %1, 0" : "=x" ( onef ) : "r" ( t ) );
            onef = _mm_shuffle_epi32( onef, 0 );
    

    Note that i'm not possitive if this version is any faster than the SSE2 one, have not profiled it, only tested the result was correct.

    If the values of each of the four floats must be different, then each of the constants can be generated and shuffled or blended together.

    Wether or not this is useful depends on if a cache miss is likely, else loading the constant from memory is quicker. Tricks like this are very helpful in vmx/altivec, but large caches on most pcs may make this less useful for sse.

    There is a good discussion of this in Agner Fog's Optimization Manual, book 2, section 13.4, http://www.agner.org/optimize/.

    Final note, the use of inline assembler above is gcc specific, the reason is to allow the use of uninitialized variables without generating a compiler warning. With vc, you may or may not need to first initialize the variables with _mm_setzero_ps(), then hope that the optimizer can remove this.

提交回复
热议问题