SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

前端 未结 2 1393
走了就别回头了
走了就别回头了 2020-12-16 06:36

Using SSE intrinsics, I\'ve gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I\'d now like to write those four out as bytes.

相关标签:
2条回答
  • 2020-12-16 06:49

    We can solve the unsigned clamping issue by doing the first stage of packing with signed saturation. [0-255] fits in a signed 16-bit int, so values in that range will remain unclamped. Values outside that range will stay on the same side of it. Thus, the signed16 -> unsigned8 step will clamp them correctly.

    ;; SSE2: good for arrays of inputs
    cvtps2dq xmm0, [rsi]      ; 4 floats
    cvtps2dq xmm1, [rsi+16]   ; 4 more floats
    packssdw xmm0, xmm1       ; 8 int16_t
    
    cvtps2dq xmm1, [rsi+32]
    cvtps2dq xmm2, [rsi+48]
    packssdw xmm1, xmm2       ; 8 more int16_t
                              ; signed because that's how packuswb treats its input
    packuswb xmm0, xmm1       ; 16 uint8_t
    movdqa   [rdi], xmm0
    

    This only requires SSE2, not SSE4.1 for packusdw.

    I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing. (Since then you'd need to mask off the sign bit before feeding it to a further pack).

    If you did use packusdw -> packuswb, you'd get bogus results when the first step saturated to a uint16_t > 0x7fff. packuswb would interpret that as a negative int16_t and saturate it to 0. packssdw would saturate such inputs to 0x7fff, the max int16_t.

    (If your 32-bit inputs are always <= 0x7fff, you can use either, but SSE4.1 packusdw takes more instruction bytes than SSE2 packsswd, and never runs faster.)


    If your source values can't be negative, and you only have one vector of 4 floats, not many, you can use harold's pshufb idea. If not, you need to clamp negative values to zero rather than truncate the by shuffling the low bytes into place.

    Using

    ;; SSE4.1, good for a single vector.  Use the PACK version above for arrays
    cvtps2dq   xmm0, xmm0
    pmaxsd     xmm0, zeroed-register
    pshufb     xmm0, [mask]
    movd       [somewhere], xmm0
    

    may be slightly more efficient than using two pack instructions, because pmax can run on port 1 or 5 (Intel Haswell). cvtps2dq is port 1 only, pshufb and pack* are port 5 only.

    0 讨论(0)
  • 2020-12-16 07:02

    There is no direct conversion from float to byte, _mm_cvtps_pi8 is a composite. _mm_cvtps_pi16 is also a composite, and in this case it's just doing some pointless stuff that you undo with the shuffle. They also return annoying __m64's.

    Anyway, we can convert to dwords (signed, but that doesn't matter), and then pack (unsigned) or shuffle them into bytes. _mm_shuffle_(e)pi8 generates a pshufb, Core2 45nm and AMD processors aren't too fond of it and you have to get a mask from somewhere.

    Either way you don't have to round to the nearest integer first, the convert will do that. At least, if you haven't messed with the rounding mode.

    Using packs 1: (not tested) -- probably not useful, packusdw already outputs unsigned words but then packuswb wants signed words again. Kept around because it is referred to elsewhere.

    cvtps2dq xmm0, xmm0  
    packusdw xmm0, xmm0     ; unsafe: saturates to a different range than packuswb accepts
    packuswb xmm0, xmm0
    movd somewhere, xmm0
    

    Using different shuffles:

    cvtps2dq xmm0, xmm0  
    packssdw xmm0, xmm0     ; correct: signed saturation on first step to feed packuswb
    packuswb xmm0, xmm0
    movd somewhere, xmm0
    

    Using shuffle: (not tested)

    cvtps2dq xmm0, xmm0
    pshufb xmm0, [shufmask]
    movd somewhere, xmm0
    
    shufmask: db 0, 4, 8, 12, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h
    
    0 讨论(0)
提交回复
热议问题