Modern CPU\'s can perform extended multiplication between two native-size words and store the low and high result in separate registers. Similarly, when performing division,
For gcc since version 4.6 you can use __int128. This works on most 64 bit hardware. For instance
To get the 128 bit result of a 64x64 bit multiplication just use
void extmul(size_t a, size_t b, size_t *lo, size_t *hi) {
__int128 result = (__int128)a * (__int128)b;
*lo = (size_t)result;
*hi = result >> 64;
}
On x86_64 gcc is smart enough to compile this to
0: 48 89 f8 mov %rdi,%rax
3: 49 89 d0 mov %rdx,%r8
6: 48 f7 e6 mul %rsi
9: 49 89 00 mov %rax,(%r8)
c: 48 89 11 mov %rdx,(%rcx)
f: c3 retq
No native 128 bit support or similar required, and after inlining only the mul instruction remains.
Edit: On a 32 bit arch this works in a similar way, you need to replace __int128_t by uint64_t and the shift width by 32. The optimization will work on even older gccs.