You can unroll your function like this. It is probably faster than what your compiler can do!
// rax as 64 bit input
xor rcx, rcx //clear addent
add rax, rax //Copy 63th bit to carry flag
adc dword ptr [@bit_counter + 63 * 4], ecx //Add carry bit to counter[64]
add rax, rax //Copy 62th bit to carry flag
adc dword ptr [@bit_counter + 62 * 4], ecx //Add carry bit to counter[63]
add rax, rax //Copy 62th bit to carry flag
adc dword ptr [@bit_counter + 61 * 4], ecx //Add carry bit to counter[62]
// ...
add rax, rax //Copy 1th bit to carry flag
adc dword ptr [@bit_counter + 1 * 4], ecx //Add carry bit to counter[1]
add rax, rax //Copy 0th bit to carry flag
adc dword ptr [@bit_counter], ecx //Add carry bit to counter[0]
EDIT:
You can try also with double increment like this:
// rax as 64 bit input
xor rcx, rcx //clear addent
//
add rax, rax //Copy 63th bit to carry flag
rcl rcx, 33 //Mov carry to 32th bit as 0bit of second uint
add rax, rax //Copy 62th bit to carry flag
adc qword ptr [@bit_counter + 62 * 8], rcx //Add rcx to 63th and 62th counters
add rax, rax //Copy 61th bit to carry flag
rcl rcx, 33 //Mov carry to 32th bit as 0bit of second uint
add rax, rax //Copy 60th bit to carry flag
adc qword ptr [@bit_counter + 60 * 8], rcx //Add rcx to 61th and 60th counters
//...