So I had an interview question before regarding bit manipulation. The company is a well known GPU company. I had very little background in assembly language (weird despite b
Since this is tagged ARM, the clz instruction is most helpful. The problem is also described as a population count. gcc has __builtin_popcount() for this. As does the ARM tools. There is this link (don't feel bad about your solution, some one made a web page with nearly the same) and also there is Dave Seal's version with six instruction for non-clz ARMs. The clz is advantageous and can be used to produce a faster algorithm, depending on the input.
As well as auselen's good reading suggestion,Hacker's Delight this bit twiddling blog maybe helpful which is talking about such things in a graphic context. At least I found it useful to understand some of Qt's blitting code. However, it has some usefulness in coding a population count routine.
The carry add unit is useful in a divide and conquer sense, making the problem O(ln n). clz is more useful if the data has runs of ones or zeros.
The Hacker's Delight entry has more background on Dave Seal's ARM code.