问题
Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.
Thanks in advance.
回答1:
Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.)
回答2:
Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms" by Furtak, Amaral, and Niewiadomski. Section 4.
It uses 4 SSE registers, has 12 steps, and runs in 19 instructions including load and store.
The same paper has some excellent work on dynamically making sorting networks with SIMD.
回答3:
To speed up sorting of strings, I ended up packing 7 bytes per double and sorting (ranking) an array of 16 doubles in SSE2, using bitonic sort to create two runs of 8, and a binary merge to merge the two runs. You can see the first part here http://mischasan.wordpress.com/2011/07/29/okay-one-more-poke-at-sse2-sorting-doubles/ (asm) and here http://mischasan.wordpress.com/2011/09/02/update-on-bitonic-sse2-sort-of-16-doubles/ (C), and the bitonic merge step (if you want to go SSE all the way) here: http://mischasan.wordpress.com/2012/11/04/sse2-odd-even-merge-the-last-step-in-sorting/ . I replaced the insertion sort at the bottom of qsort with this sort, and it's about 5 times as fast as straight qsort. HTH
I hadn't seen the UofA paper; the bitonic logic is from old school (CTM) GPGPU programming.
Sorry about the embedded link strings; I don't know how to add clickable links in comments stackoverflow.
回答4:
All sorting algorithms require "swapping" values from one place to another. Since you're talking about a literal CPU register, that means any sort would need another register to use as a temporary place to hold the bytes being swapped.
I've never seen a chip with a built-in method for sorting bytes within a register. Not saying it hasn't been done, but I can't think of many uses for such an instruction.
来源:https://stackoverflow.com/questions/1580686/fast-in-register-sort-of-bytes