I want to pack the bits based on arbitrary mask

旧巷老猫 提交于 2019-11-29 11:29:03

If you're targeting x86 most compilers will have an instrinsic for the pdep (parallel bit deposit) instruction which directly performs the operation you want, in hardware, at a rate of 1 per cycle (3 cycles latency)1, on Intel hardware that supports it. For example, gcc offers it as the _pdep_u32 and _pdep_u64 intrinsic functions.

Unfortunately, on AMD Ryzen (the only AMD hardware that supports BMI2) this operation is very slow: one per 18 cycles. You might want to have a separate code-path to support non-Intel platforms if they are important to you.

If you aren't on x86, you can find general purpose implementations of these options here - the specific operation you want is expand_right - and this other section will probably be of great interest in that it specifically covers the simple case where you are dealing with word-sized elements.

In practice, if you are really dealing with 8-bit data and mask values, you might just use a precomputed lookup table - either a big 8 bit x 8 bit = 65k one which covers all {data, mask} combinations and which gives you the answer directly, or a 256-entry one which covers all mask values and gives you some coefficients for a simple bit-shifting calculation or a multiplication-based code.

FWIW, I'm not sure how you can do it easily with 5 rotate instructions, because it seems that the naive solution needs 1 rotate instruction for each bit, whether set or not (so for a word size of 8 bits, 7 or 8 rotate2 instructions).


1 Of course, the performance in principle depends on the hardware, but on all the mainstream Intel CPUs that implement it, it's 1 cycle throughput, 3 cycles latency (not sure about AMD).

2 Only 7 rotates because the "rotate of 0" operation for the lowest order bit can evidently be omitted.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!