Here's another way to emulate "psrab" which works for SSE or AVX with 1 scratch register:
__ punpckhbw(scratch, src); // junk in low bytes
__ punpcklbw(dst, src); // junk in low bytes
__ psraw(scratch, 8 + shift);
__ psraw(dst, 8 + shift);
__ packsswb(dst, scratch); // pack words to get result