I\'m looking for a fast method to efficiently compute (a⋅b) modulo n (in the mathematical sense of that) for
You could do it the old-fashioned way with shift/add/subtract. The below code assumes a < n and
n < 263 (so things don't overflow):
uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
uint64_t rv = 0;
while (b) {
if (b&1)
if ((rv += a) >= n) rv -= n;
if ((a += a) >= n) a -= n;
b >>= 1; }
return rv;
}
You could use while (a && b) for the loop instead to short-circuit things if it's likely that a will be a factor of n. Will be slightly slower (more comparisons and likely correctly predicted branches) if a is not a factor of n.
If you really, absolutely, need that last bit (allowing n up to 264-1), you can use:
uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
uint64_t rv = 0;
while (b) {
if (b&1) {
rv += a;
if (rv < a || rv >= n) rv -= n; }
uint64_t t = a;
a += a;
if (a < t || a >= n) a -= n;
b >>= 1; }
return rv;
}
Alternately, just use GCC instrinsics to access the underlying x64 instructions:
inline uint64_t mulmod(uint64_t a, uint64_t b, uint64_t n) {
uint64_t rv;
asm ("mul %3" : "=d"(rv), "=a"(a) : "1"(a), "r"(b));
asm ("div %4" : "=d"(rv), "=a"(a) : "0"(rv), "1"(a), "r"(n));
return rv;
}
The 64-bit div instruction is really slow, however, so the loop might actually be faster. You'd need to profile to be sure.