As you know we have below Shift instructions in SIMD SSE: PSLL
(W-D-Q) and PSRL
(W-D-Q)
There's no PSLLB
instruction, so how can we shift vectors of 8bit values (single bytes)?
In the special-case of left-shift-by-one, you can use paddb xmm0, xmm0
.
As Jester points out in comments, the best option to emulate the non-existent psrlb
and psllb
is to use a wider shift and then mask off any bits that crossed element boundaries.
e.g.
psrlw xmm0, 2 ; doesn't matter what size (w/d/q): performance is the same for all sizes on all CPUs
pand xmm0, [mask_right2]
section .rodata
align 16
;; required mask depends on the shift count
mask_right2: times 16 db 0xff >> 2 (16 bytes of 0x3f)
Or broadcast 0x3f into a vector register ahead of a loop some other way, like vpbroadcastd
or vbroadcastss
from a dword in memory, SSE3 movddup
from a qword, or just a movdqa
vector load. (vpbroadcastb
takes an extra ALU uop, unlike dword or wider broadcasts which are just simple loads). Or generate on the fly with a sequence like pcmpeqd xmm0,xmm0
/ psrlw xmm0, 8+2
/ packuswb xmm0,xmm0
. With the right choice of shift count, you can generate any pattern of 2n-1 bytes (repeated zeros and then repeated ones).
mov r32, imm32
/ movd xmm, r32
and shuffle is also an option, but probably won't save instruction bytes compared to the pcmpeqw
/ ... sequence. (Note that the register-source version of VBROADCASTSS
is AVX2-only, which doesn't matter here since 256b integer shifts are also AVX2-only.)
For a variable-count vector shift, creating the mask in an integer register and broadcasting it to a vector is one option (use pshufb
with an all-zero register to broadcast the low byte, or use imul eax, eax, 0x01010101
to go from a byte to a dword for movd
+ pshufd
). You could also use the pcmpeqd
method to create an all-ones vector and use a psrlw xmm0, xmm1
and then pack
or pshufb
.
I don't see any similarly efficient way to emulate arithmetic right-shift (the non-existant PSRAB
). The high byte of each word is handled correctly by PSRAW
. Shifting the low byte of each word to the high position would let another PSRAW
copy its sign bit as many times as required.
; input in xmm0. Using AVX to save on mov instructions
VPSLLDQ xmm1, xmm0, 1 ; or VPSLLW xmm1, xmm0, 8, but this distributes one of the uops to the shuffle port
VPSRAW xmm1, xmm1, 8+2 ; shift low bytes back to final destination
VPSRAW xmm0, xmm0, 2 ; shift high bytes, leaving garbage in low bytes
VPBLENDVB xmm0, xmm1, xmm0, xmm2 ; (where xmm2 holds a mask of alternating 0 and -1, which could be generated with pcmpeqw / psrlw 8). This insn is fairly slow
There is no immediate-blend with byte granularity, because a single immediate byte can only encode 8 elements.
Without VPBLENDVB (possibly better even when it's available, if generating or loading a constant for it is slow):
VPSLLDQ xmm1, xmm0, 1 ; or VPSLLW 8
VPSRAW xmm1, xmm1, n ; low bytes in the wrong place
VPSRAW xmm0, xmm0, 8+n ; shift high bytes all the way to the bottom of the element
VPSLLW xmm0, xmm0, 8 ; high bytes back in place, with zero in the low byte. (VPSLLDQ can't work: PSRAW 8+n leaves garbage we need to clear)
VPSRLW xmm1, xmm1, 8 ; shift low bytes into place, leaving zero in the high byte. (VPSRLDQ 1 could do this, if we started with VPSLLW instead of VPSLLDQ)
VPOR xmm0, xmm0, xmm1
Using PAND/PANDN/POR with a constant (alternating 0/-1 bytes) in a register would also work (with far less pressure on the shift port) for doing a byte-blend, and is a better choice if you have to do this in a loop.
Here's another way to emulate "psrab" which works for SSE or AVX with 1 scratch register:
__ punpckhbw(scratch, src); // junk in low bytes
__ punpcklbw(dst, src); // junk in low bytes
__ psraw(scratch, 8 + shift);
__ psraw(dst, 8 + shift);
__ packsswb(dst, scratch); // pack words to get result
来源:https://stackoverflow.com/questions/35002937/sse-simd-shift-with-one-byte-element-size-granularity