问题
Let's say I want to reverse the byte order of a very large byte array. I can do this the slow way using the main registers but I would like to speed it up using the XMM or YMM registers.
Is there a way to reverse the byte order in an XMM or YMM register?
回答1:
Yes, use SSSE3 _mm_shuffle_epi8 or AVX2 _mm256_shuffle_epi8 to shuffle bytes within 16-byte AVX2 "lanes". Depending on the shuffle control vector, you can swap pairs of bytes, reverse 4-byte units, or reverse 8-byte units. Or reverse all 16 bytes.
But vpshufb isn't lane-crossing, so you can't reverse 32 bytes with one instruction until AVX512VBMI vpermb. vpshufb ymm does 2x 16-byte shuffles in the two 128-bit lanes of the YMM vector.
So if you're byte-reversing an entire array, rather than the endianness / byte-order of individual elements in an array, you have 3 options:
- Stick to 128-bit vectors (simple and portable, and probably not slower on current CPUs). And only needs 16-byte alignment for best performance.
- Load with
vmovdqu/vinsert128, thenvpshufbthen 32-byte store. (Or do 32-byte loads and split 16-byte stores, but that's probably not as good). Vectorize random init and print for BigInt with decimal digit array, with AVX2? includes a cache-blocked byte-aarray reverse into a tmp buffer to feedfwritein 8kiB chunks. - Use
vpermqto lane-swap before or aftervpshufb(not great on AMD, and bottlenecks on 1 per clock shuffle throughput on current Intel). But potentially very good on Ice Lake (2 shuffle ports)
vpshufb is a single uop instruction on Intel, or 2 on AMD, and processes 32 bytes of data at once.
For very large inputs, it's probably worth it to reach a 32 or 64-byte alignment boundary before your vectorized loop, so none of the loads/stores cross cache-line boundaries. (For small inputs the minor benefit isn't worth the extra prologue/epilogue code and branching.)
But potentially even better is to only swap a 16kiB chunk before you use it, so it's still hot in L1d cache when the next step reads it. This is called cache blocking. Or maybe use 128kiB chunks to block for L2 cache size.
You might swap in chunks as you read the data from a file. e.g. do read() system calls in chunks of 64k or 128k and swap the result while it's still hot in cache after the kernel copied data from the pagecache into your user-space buffer. Or use mmap to memory-map a file, and run a copy-and-swap loop from that. (Or for a private mapping, in-place swap; but that will trigger copy-on-write anyway so not much benefit. And file-backed mmap on Linux can't use anonymous hugepages).
Another option is to simply swap on the fly if you only read the data a couple times; if the later uses are still memory bound, or have room for a shuffle uop without bottlenecking, it probably won't slow them down to shuffle on the fly.
A pass that touches all your data and only byte-swaps it has very poor computational intensity; you want to be doing more things with your data while it's in registers, or at least while it's hot in cache. But if you only byte-swap once and then read the data many times, or in a random access pattern, or from another language like Python or JavaScript that can't efficiently swap on the fly, then sure do a swap pass.
Or a swap pass is helpful if you will make multiple passes over it that aren't memory-bound, and an extra shuffle would slow down each later pass. In that case you do want to cache-block the swapping so the later pass's input is hot in cache.
The scalar option, bswap, is limited to at best 8 bytes per clock cycle, and every 8 bytes needs a separate load and store instruction. (movbe to load from memory with byte-swapping saves an instruction, but on mainstream CPUs doesn't micro-fuse into a single load+swap uop. On Silvermont it's single-uop, though.) Also, Intel bswap r64 is 2 uops, so it's not great.
This might saturate single-threaded memory bandwidth on modern CPUs with some loop unrolling, but SIMD with fewer total uops to process the same data lets out-of-order execution "see" farther ahead and start processing TLB misses for upcoming pages sooner, for example. HW data prefetch and TLB prefetch do help a lot, but it's typically at least slightly better to use wider loads/stores for memcpy.
(vpshufb is cheap enough that this will still basically perform like memcpy. Or better if rewriting in place.)
And of course if you ever have any cache hits, even just L3 cache, SIMD will really shine.
回答2:
I can't compete legendary Peter Cordes... I want to show C implementation.
Here is are examples of reversing bytes order using C intrinsics (can be used for byte-reverse an entire array).
There are 3 code samples.
- Using SSE2 instruction set.
- Using SSSE3 instruction set.
- Using AVX2 instruction set.
//Initialize XMM register with uint8 values 0 to 15 (for testing):
__m128i a_F_E_D_C_B_A_9_8_7_6_5_4_3_2_1_0 = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
//SSE2:
//Advantage: No need to build a shuffle mask (efficient for very short loops).
//////////////////////////////////////////////////////////////////////////
//Reverse order of uint32:
__m128i a_3_2_1_0_7_6_5_4_B_A_9_8_F_E_D_C = _mm_shuffle_epi32(a_F_E_D_C_B_A_9_8_7_6_5_4_3_2_1_0, _MM_SHUFFLE(0, 1, 2, 3));
//Swap pairs of uint16:
__m128i a_1_0_3_2_5_4_7_6_9_8_B_A_D_C_F_E = _mm_shufflehi_epi16(_mm_shufflelo_epi16(a_3_2_1_0_7_6_5_4_B_A_9_8_F_E_D_C, _MM_SHUFFLE(2, 3, 0, 1)), _MM_SHUFFLE(2, 3, 0, 1));
//Swap pairs of uint8:
__m128i a_0_1_2_3_4_5_6_7_8_9_A_B_C_D_E_F = _mm_or_si128(_mm_slli_epi16(a_1_0_3_2_5_4_7_6_9_8_B_A_D_C_F_E, 8), _mm_srli_epi16(a_1_0_3_2_5_4_7_6_9_8_B_A_D_C_F_E, 8));
//////////////////////////////////////////////////////////////////////////
//SSSE3:
//Advantage: Not requires AVX2 support
//////////////////////////////////////////////////////////////////////////
//Build shuffle mask
const __m128i shuffle_mask = _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
a_0_1_2_3_4_5_6_7_8_9_A_B_C_D_E_F = _mm_shuffle_epi8(a_F_E_D_C_B_A_9_8_7_6_5_4_3_2_1_0, shuffle_mask);
//////////////////////////////////////////////////////////////////////////
//AVX2:
//Advantage: Potentially faster than SSSE3
//////////////////////////////////////////////////////////////////////////
//Initialize YMM register with uint8 values 0 to 31 (for testing):
__m256i a__31_to_0 = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
//Build shuffle mask
const __m256i shuffle_mask2 = _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
//Reverse bytes oreder of upper lane and lower lane of YMM register.
__m256i a__16_to_31__0_to_15 = _mm256_shuffle_epi8(a__31_to_0, shuffle_mask2);
//Swap upper and lower lane of YMM register
__m256i a__0_to_31 = _mm256_permute4x64_epi64(a__16_to_31__0_to_15, _MM_SHUFFLE(1, 0, 3, 2));
//////////////////////////////////////////////////////////////////////////
来源:https://stackoverflow.com/questions/56407741/reverse-byte-order-in-xmm-or-ymm-register