You could use the set of counters, each of different size. Firstly accumulate 3 values in 2-bit counters, then unpack them and update 4-bit counters. When 15 values are ready, unpack to byte-sized counters, and after 255 values update bit_counter[].
All this work may be done in parallel in 128 bit SSE registers. On modern processors only one instruction is needed to unpack 1 bit to 2. Just multiply source quadword to itself with PCLMULQDQ instruction. This will interleave source bits with zeros. The same trick may help to unpack 2 bits to 4. And unpacking of 4 and 8 bits may be done with shuffles, unpacks and simple logical operations.
Average performance seems to be good, but the price is 120 bytes for additional counters and quite a lot of assembly code.