So I had an interview question before regarding bit manipulation. The company is a well known GPU company. I had very little background in assembly language (weird despite b
If this code is fast or not depends on the processor. For sure it will be not very fast on Cortex-A8 but may run very fast on Cortex-A9 and newer CPU.
It is however a very short solution.
Expects input in r0, and returns output in r0
vmov.32 d0[0], r0
vcnt.8 d0, d0
vmov.32 r0, d0[0]
add r0, r0, r0, lsr #16
add r0, r0, r0, lsr #8
and r0, r0, #31
The main work is done in the vcnt.8 instruction which counts the bits of each byte in a NEON register and stores the bitcount back into the bytes of D0.
There is no vcnt.32 form, only .8, so you need to horizontally add the 4 bytes together, which is what the rest of the code is doing.