Fastest way to calculate a 128-bit integer modulo a 64-bit integer

后端 未结 13 1803
谎友^
谎友^ 2020-12-01 00:15

I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What\'s the fastest way to calculate A % B - that is the (64-bit) remainder from dividing A

相关标签:
13条回答
  • 2020-12-01 00:41

    If 128-bit unsigned by 63-bit unsigned is good enough, then it can be done in a loop doing at most 63 cycles.

    Consider this a proposed solution MSNs' overflow problem by limiting it to 1-bit. We do so by splitting the problem in 2, modular multiplication and adding the results at the end.

    In the following example upper corresponds to the most significant 64-bits, lower to the least significant 64-bits and div is the divisor.

    unsigned 128_mod(uint64_t upper, uint64_t lower, uint64_t div) {
      uint64_t result = 0;
      uint64_t a = (~0%div)+1;
      upper %= div; // the resulting bit-length determines number of cycles required
    
      // first we work out modular multiplication of (2^64*upper)%div
      while (upper != 0){
        if(upper&1 == 1){
          result += a;
          if(result >= div){result -= div;}
        }
        a <<= 1;
        if(a >= div){a -= div;}
        upper >>= 1;
      }
    
      // add up the 2 results and return the modulus
      if(lower>div){lower -= div;}
      return (lower+result)%div;
    }
    

    The only problem is that, if the divisor is 64-bits then we get overflows of 1-bit (loss of information) giving a faulty result.

    It bugs me that I haven't figured out a neat way to handle the overflows.

    0 讨论(0)
  • 2020-12-01 00:43

    This is almost untested partly speed modificated Mod128by64 'Russian peasant' algorithm function. Unfortunately I'm a Delphi user so this function works under Delphi. :) But the assembler is almost the same so...

    function Mod128by64(Dividend: PUInt128; Divisor: PUInt64): UInt64;
    //In : eax = @Dividend
    //   : edx = @Divisor
    //Out: eax:edx as Remainder
    asm
    //Registers inside rutine
    //Divisor = edx:ebp
    //Dividend = bh:ebx:edx //We need 64 bits + 1 bit in bh
    //Result = esi:edi
    //ecx = Loop counter and Dividend index
      push    ebx                     //Store registers to stack
      push    esi
      push    edi
      push    ebp
      mov     ebp, [edx]              //Divisor = edx:ebp
      mov     edx, [edx + 4]
      mov     ecx, ebp                //Div by 0 test
      or      ecx, edx                
      jz      @DivByZero
      xor     edi, edi                //Clear result
      xor     esi, esi
    //Start of 64 bit division Loop
      mov     ecx, 15                 //Load byte loop shift counter and Dividend index
    @SkipShift8Bits:                  //Small Dividend numbers shift optimisation
      cmp     [eax + ecx], ch         //Zero test
      jnz     @EndSkipShiftDividend
      loop    @SkipShift8Bits         //Skip 8 bit loop
    @EndSkipShiftDividend:
      test    edx, $FF000000          //Huge Divisor Numbers Shift Optimisation
      jz      @Shift8Bits             //This Divisor is > $00FFFFFF:FFFFFFFF
      mov     ecx, 8                  //Load byte shift counter
      mov     esi, [eax + 12]         //Do fast 56 bit (7 bytes) shift...
      shr     esi, cl                 //esi = $00XXXXXX
      mov     edi, [eax + 9]          //Load for one byte right shifted 32 bit value
    @Shift8Bits:
      mov     bl, [eax + ecx]         //Load 8 bits of Dividend
    //Here we can unrole partial loop 8 bit division to increase execution speed...
      mov     ch, 8                   //Set partial byte counter value
    @Do65BitsShift:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      setc    bh                      //Save 65th bit
      sub     edi, ebp                //Compare dividend and  divisor
      sbb     esi, edx                //Subtract the divisor
      sbb     bh, 0                   //Use 65th bit in bh
      jnc     @NoCarryAtCmp           //Test...
      add     edi, ebp                //Return privius dividend state
      adc     esi, edx
    @NoCarryAtCmp:
      dec     ch                      //Decrement counter
      jnz     @Do65BitsShift
    //End of 8 bit (byte) partial division loop
      dec     cl                      //Decrement byte loop shift counter
      jns     @Shift8Bits             //Last jump at cl = 0!!!
    //End of 64 bit division loop
      mov     eax, edi                //Load result to eax:edx
      mov     edx, esi
    @RestoreRegisters:
      pop     ebp                     //Restore Registers
      pop     edi
      pop     esi
      pop     ebx
      ret
    @DivByZero:
      xor     eax, eax                //Here you can raise Div by 0 exception, now function only return 0.
      xor     edx, edx
      jmp     @RestoreRegisters
    end;
    

    At least one more speed optimisation is possible! After 'Huge Divisor Numbers Shift Optimisation' we can test divisors high bit, if it is 0 we do not need to use extra bh register as 65th bit to store in it. So unrolled part of loop can look like:

      shl     bl,1                    //Shift dividend left for one bit
      rcl     edi,1
      rcl     esi,1
      sub     edi, ebp                //Compare dividend and  divisor
      sbb     esi, edx                //Subtract the divisor
      jnc     @NoCarryAtCmpX
      add     edi, ebp                //Return privius dividend state
      adc     esi, edx
    @NoCarryAtCmpX:
    
    0 讨论(0)
  • 2020-12-01 00:44

    I know the question specified 32-bit code, but the answer for 64-bit may be useful or interesting to others.

    And yes, 64b/32b => 32b division does make a useful building-block for 128b % 64b => 64b. libgcc's __umoddi3 (source linked below) gives an idea of how to do that sort of thing, but it only implements 2N % 2N => 2N on top of a 2N / N => N division, not 4N % 2N => 2N.

    Wider multi-precision libraries are available, e.g. https://gmplib.org/manual/Integer-Division.html#Integer-Division.


    GNU C on 64-bit machines does provide an __int128 type, and libgcc functions to multiply and divide as efficiently as possible on the target architecture.

    x86-64's div r/m64 instruction does 128b/64b => 64b division (also producing remainder as a second output), but it faults if the quotient overflows. So you can't directly use it if A/B > 2^64-1, but you can get gcc to use it for you (or even inline the same code that libgcc uses).


    This compiles (Godbolt compiler explorer) to one or two div instructions (which happen inside a libgcc function call). If there was a faster way, libgcc would probably use that instead.

    #include <stdint.h>
    uint64_t AmodB(unsigned __int128 A, uint64_t B) {
      return A % B;
    }
    

    The __umodti3 function it calls calculates a full 128b/128b modulo, but the implementation of that function does check for the special case where the divisor's high half is 0, as you can see in the libgcc source. (libgcc builds the si/di/ti version of the function from that code, as appropriate for the target architecture. udiv_qrnnd is an inline asm macro that does unsigned 2N/N => N division for the target architecture.

    For x86-64 (and other architectures with a hardware divide instruction), the fast-path (when high_half(A) < B; guaranteeing div won't fault) is just two not-taken branches, some fluff for out-of-order CPUs to chew through, and a single div r64 instruction, which takes about 50-100 cycles1 on modern x86 CPUs, according to Agner Fog's insn tables. Some other work can be happening in parallel with div, but the integer divide unit is not very pipelined and div decodes to a lot of uops (unlike FP division).

    The fallback path still only uses two 64-bit div instructions for the case where B is only 64-bit, but A/B doesn't fit in 64 bits so A/B directly would fault.

    Note that libgcc's __umodti3 just inlines __udivmoddi4 into a wrapper that only returns the remainder.

    Footnote 1: 32-bit div is over 2x faster on Intel CPUs. On AMD CPUs, performance only depends on the size of the actual input values, even if they're small values in a 64-bit register. If small values are common, it might be worth benchmarking a branch to a simple 32-bit division version before doing 64-bit or 128-bit division.


    For repeated modulo by the same B

    It might be worth considering calculating a fixed-point multiplicative inverse for B, if one exists. For example, with compile-time constants, gcc does the optimization for types narrower than 128b.

    uint64_t modulo_by_constant64(uint64_t A) { return A % 0x12345678ABULL; }
    
        movabs  rdx, -2233785418547900415
        mov     rax, rdi
        mul     rdx
        mov     rax, rdx             # wasted instruction, could have kept using RDX.
        movabs  rdx, 78187493547
        shr     rax, 36            # division result
        imul    rax, rdx           # multiply and subtract to get the modulo
        sub     rdi, rax
        mov     rax, rdi
        ret
    

    x86's mul r64 instruction does 64b*64b => 128b (rdx:rax) multiplication, and can be used as a building block to construct a 128b * 128b => 256b multiply to implement the same algorithm. Since we only need the high half of the full 256b result, that saves a few multiplies.

    Modern Intel CPUs have very high performance mul: 3c latency, one per clock throughput. However, the exact combination of shifts and adds required varies with the constant, so the general case of calculating a multiplicative inverse at run-time isn't quite as efficient each time its used as a JIT-compiled or statically-compiled version (even on top of the pre-computation overhead).

    IDK where the break-even point would be. For JIT-compiling, it will be higher than ~200 reuses, unless you cache generated code for commonly-used B values. For the "normal" way, it might possibly be in the range of 200 reuses, but IDK how expensive it would be to find a modular multiplicative inverse for 128-bit / 64-bit division.

    libdivide can do this for you, but only for 32 and 64-bit types. Still, it's probably a good starting point.

    0 讨论(0)
  • 2020-12-01 00:44

    The accepted answer by @caf was real nice and highly rated, yet it contain a bug not seen for years.

    To help test that and other solutions, I am posting a test harness and making it community wiki.

    unsigned cafMod(unsigned A, unsigned B) {
      assert(B);
      unsigned X = B;
      // while (X < A / 2) {  Original code used <
      while (X <= A / 2) {
        X <<= 1;
      }
      while (A >= B) {
        if (A >= X) A -= X;
        X >>= 1;
      }
      return A;
    }
    
    void cafMod_test(unsigned num, unsigned den) {
      if (den == 0) return;
      unsigned y0 = num % den;
      unsigned y1 = mod(num, den);
      if (y0 != y1) {
        printf("FAIL num:%x den:%x %x %x\n", num, den, y0, y1);
        fflush(stdout);
        exit(-1);
      }
    }
    
    unsigned rand_unsigned() {
      unsigned x = (unsigned) rand();
      return x * 2 ^ (unsigned) rand();
    }
    
    void cafMod_tests(void) {
      const unsigned i[] = { 0, 1, 2, 3, 0x7FFFFFFF, 0x80000000, 
          UINT_MAX - 3, UINT_MAX - 2, UINT_MAX - 1, UINT_MAX };
      for (unsigned den = 0; den < sizeof i / sizeof i[0]; den++) {
        if (i[den] == 0) continue;
        for (unsigned num = 0; num < sizeof i / sizeof i[0]; num++) {
          cafMod_test(i[num], i[den]);
        }
      }
      cafMod_test(0x8711dd11, 0x4388ee88);
      cafMod_test(0xf64835a1, 0xf64835a);
    
      time_t t;
      time(&t);
      srand((unsigned) t);
      printf("%u\n", (unsigned) t);fflush(stdout);
      for (long long n = 10000LL * 1000LL * 1000LL; n > 0; n--) {
        cafMod_test(rand_unsigned(), rand_unsigned());
      }
    
      puts("Done");
    }
    
    int main(void) {
      cafMod_tests();
      return 0;
    }
    
    0 讨论(0)
  • 2020-12-01 00:45

    If you have a recent x86 machine, there are 128-bit registers for SSE2+. I've never tried to write assembly for anything other than basic x86, but I suspect there are some guides out there.

    0 讨论(0)
  • 2020-12-01 00:47

    You can use the division version of Russian Peasant Multiplication.

    To find the remainder, execute (in pseudo-code):

    X = B;
    
    while (X <= A/2)
    {
        X <<= 1;
    }
    
    while (A >= B)
    {
        if (A >= X)
            A -= X;
        X >>= 1;
    }
    

    The modulus is left in A.

    You'll need to implement the shifts, comparisons and subtractions to operate on values made up of a pair of 64 bit numbers, but that's fairly trivial (likely you should implement the left-shift-by-1 as X + X).

    This will loop at most 255 times (with a 128 bit A). Of course you need to do a pre-check for a zero divisor.

    0 讨论(0)
提交回复
热议问题