SSE/SIMD shift with one-byte element size / granularity?

As you know we have below Shift instructions in SIMD SSE: PSLL(W-D-Q) and PSRL(W-D-Q)

There's no PSLLB instruction, so how can we shift vectors of 8bit values (single bytes)?

In the special-case of left-shift-by-one, you can use paddb xmm0, xmm0.

As Jester points out in comments, the best option to emulate the non-existent psrlb and psllb is to use a wider shift and then mask off any bits that crossed element boundaries.

e.g.

    psrlw   xmm0, 2       ; doesn't matter what size (w/d/q): performance is the same for all sizes on all CPUs
    pand    xmm0, [mask_right2]

section .rodata
  align 16
    ;; required mask depends on the shift count
    mask_right2: times 16  db 0xff >> 2      (16 bytes of 0x3f)

Or broadcast 0x3f into a vector register ahead of a loop some other way, like vpbroadcastd or vbroadcastss from a dword in memory, SSE3 movddup from a qword, or just a movdqa vector load. (vpbroadcastb takes an extra ALU uop, unlike dword or wider broadcasts which are just simple loads). Or generate on the fly with a sequence like pcmpeqd xmm0,xmm0 / psrlw xmm0, 8+2 / packuswb xmm0,xmm0. With the right choice of shift count, you can generate any pattern of 2ⁿ-1 bytes (repeated zeros and then repeated ones).

mov r32, imm32 / movd xmm, r32 and shuffle is also an option, but probably won't save instruction bytes compared to the pcmpeqw / ... sequence. (Note that the register-source version of VBROADCASTSS is AVX2-only, which doesn't matter here since 256b integer shifts are also AVX2-only.)

For a variable-count vector shift, creating the mask in an integer register and broadcasting it to a vector is one option (use pshufb with an all-zero register to broadcast the low byte, or use imul eax, eax, 0x01010101 to go from a byte to a dword for movd + pshufd). You could also use the pcmpeqd method to create an all-ones vector and use a psrlw xmm0, xmm1 and then pack or pshufb.

I don't see any similarly efficient way to emulate arithmetic right-shift (the non-existant PSRAB). The high byte of each word is handled correctly by PSRAW. Shifting the low byte of each word to the high position would let another PSRAW copy its sign bit as many times as required.

; input in xmm0.  Using AVX to save on mov instructions
VPSLLDQ   xmm1, xmm0, 1      ; or VPSLLW xmm1, xmm0, 8, but this distributes one of the uops to the shuffle port
VPSRAW    xmm1, xmm1, 8+2    ; shift low bytes back to final destination

VPSRAW    xmm0, xmm0, 2      ; shift high bytes, leaving garbage in low bytes
VPBLENDVB xmm0, xmm1, xmm0, xmm2  ; (where xmm2 holds a mask of alternating 0 and -1, which could be generated with pcmpeqw / psrlw 8).  This insn is fairly slow

There is no immediate-blend with byte granularity, because a single immediate byte can only encode 8 elements.

Without VPBLENDVB (possibly better even when it's available, if generating or loading a constant for it is slow):

VPSLLDQ   xmm1, xmm0, 1      ; or VPSLLW 8
VPSRAW    xmm1, xmm1, n      ; low bytes in the wrong place

VPSRAW    xmm0, xmm0, 8+n    ; shift high bytes all the way to the bottom of the element
VPSLLW    xmm0, xmm0, 8      ; high bytes back in place, with zero in the low byte.  (VPSLLDQ can't work: PSRAW 8+n leaves garbage we need to clear)

VPSRLW    xmm1, xmm1, 8      ; shift low bytes into place, leaving zero in the high byte.  (VPSRLDQ 1 could do this, if we started with VPSLLW instead of VPSLLDQ)
VPOR      xmm0, xmm0, xmm1

Using PAND/PANDN/POR with a constant (alternating 0/-1 bytes) in a register would also work (with far less pressure on the shift port) for doing a byte-blend, and is a better choice if you have to do this in a loop.

Here's another way to emulate "psrab" which works for SSE or AVX with 1 scratch register:

  __ punpckhbw(scratch, src);  // junk in low bytes
  __ punpcklbw(dst, src);      // junk in low bytes
  __ psraw(scratch, 8 + shift);
  __ psraw(dst, 8 + shift);
  __ packsswb(dst, scratch);   // pack words to get result

来源：https://stackoverflow.com/questions/35002937/sse-simd-shift-with-one-byte-element-size-granularity

标签

assembly

x86

sse