Fastest way to calculate a 128-bit integer modulo a 64-bit integer

后端 未结 13 1802
谎友^
谎友^ 2020-12-01 00:15

I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What\'s the fastest way to calculate A % B - that is the (64-bit) remainder from dividing A

相关标签:
13条回答
  • 2020-12-01 00:24

    Since there is no predefined 128-bit integer type in C, bits of A have to be represented in an array. Although B (64-bit integer) can be stored in an unsigned long long int variable, it is needed to put bits of B into another array in order to work on A and B efficiently.

    After that, B is incremented as Bx2, Bx3, Bx4, ... until it is the greatest B less than A. And then (A-B) can be calculated, using some subtraction knowledge for base 2.

    Is this the kind of solution that you are looking for?

    0 讨论(0)
  • 2020-12-01 00:27

    Perhaps you're looking for a finished program, but the basic algorithms for multi-precision arithmetic can be found in Knuth's Art of Computer Programming, Volume 2. You can find the division algorithm described online here. The algorithms deal with arbitrary multi-precision arithmetic, and so are more general than you need, but you should be able to simplify them for 128 bit arithmetic done on 64- or 32-bit digits. Be prepared for a reasonable amount of work (a) understanding the algorithm, and (b) converting it to C or assembler.

    You might also want to check out Hacker's Delight, which is full of very clever assembler and other low-level hackery, including some multi-precision arithmetic.

    0 讨论(0)
  • 2020-12-01 00:27

    I have made both version of Mod128by64 'Russian peasant' division function: classic and speed optimised. Speed optimised can do on my 3Ghz PC more than 1000.000 random calculations per second and is more than three times faster than classic function. If we compare the execution time of calculating 128 by 64 and calculating 64 by 64 bit modulo than this function is only about 50% slower.

    Classic Russian peasant:

    function Mod128by64Clasic(Dividend: PUInt128; Divisor: PUInt64): UInt64;
    //In : eax = @Dividend
    //   : edx = @Divisor
    //Out: eax:edx as Remainder
    asm
    //Registers inside rutine
    //edx:ebp = Divisor
    //ecx = Loop counter
    //Result = esi:edi
      push    ebx                     //Store registers to stack
      push    esi
      push    edi
      push    ebp
      mov     ebp, [edx]              //Load  divisor to edx:ebp
      mov     edx, [edx + 4]
      mov     ecx, ebp                //Div by 0 test
      or      ecx, edx
      jz      @DivByZero
      push    [eax]                   //Store Divisor to the stack
      push    [eax + 4]
      push    [eax + 8]
      push    [eax + 12]
      xor     edi, edi                //Clear result
      xor     esi, esi
      mov     ecx, 128                //Load shift counter
    @Do128BitsShift:
      shl     [esp + 12], 1           //Shift dividend from stack left for one bit
      rcl     [esp + 8], 1
      rcl     [esp + 4], 1
      rcl     [esp], 1
      rcl     edi, 1
      rcl     esi, 1
      setc    bh                      //Save 65th bit
      sub     edi, ebp                //Compare dividend and  divisor
      sbb     esi, edx                //Subtract the divisor
      sbb     bh, 0                   //Use 65th bit in bh
      jnc     @NoCarryAtCmp           //Test...
      add     edi, ebp                //Return privius dividend state
      adc     esi, edx
    @NoCarryAtCmp:
      loop    @Do128BitsShift
    //End of 128 bit division loop
      mov     eax, edi                //Load result to eax:edx
      mov     edx, esi
    @RestoreRegisters:
      lea     esp, esp + 16           //Restore Divisors space on stack
      pop     ebp                     //Restore Registers
      pop     edi                     
      pop     esi
      pop     ebx
      ret
    @DivByZero:
      xor     eax, eax                //Here you can raise Div by 0 exception, now function only return 0.
      xor     edx, edx
      jmp     @RestoreRegisters
    end;
    

    Speed optimised Russian peasant:

    function Mod128by64Oprimized(Dividend: PUInt128; Divisor: PUInt64): UInt64;
    //In : eax = @Dividend
    //   : edx = @Divisor
    //Out: eax:edx as Remainder
    asm
    //Registers inside rutine
    //Divisor = edx:ebp
    //Dividend = ebx:edx //We need 64 bits
    //Result = esi:edi
    //ecx = Loop counter and Dividend index
      push    ebx                     //Store registers to stack
      push    esi
      push    edi
      push    ebp
      mov     ebp, [edx]              //Divisor = edx:ebp
      mov     edx, [edx + 4]
      mov     ecx, ebp                //Div by 0 test
      or      ecx, edx
      jz      @DivByZero
      xor     edi, edi                //Clear result
      xor     esi, esi
    //Start of 64 bit division Loop
      mov     ecx, 15                 //Load byte loop shift counter and Dividend index
    @SkipShift8Bits:                  //Small Dividend numbers shift optimisation
      cmp     [eax + ecx], ch         //Zero test
      jnz     @EndSkipShiftDividend
      loop    @SkipShift8Bits         //Skip Compute 8 Bits unroled loop ?
    @EndSkipShiftDividend:
      test    edx, $FF000000          //Huge Divisor Numbers Shift Optimisation
      jz      @Shift8Bits             //This Divisor is > $00FFFFFF:FFFFFFFF
      mov     ecx, 8                  //Load byte shift counter
      mov     esi, [eax + 12]         //Do fast 56 bit (7 bytes) shift...
      shr     esi, cl                 //esi = $00XXXXXX
      mov     edi, [eax + 9]          //Load for one byte right shifted 32 bit value
    @Shift8Bits:
      mov     bl, [eax + ecx]         //Load 8 bit part of Dividend
    //Compute 8 Bits unroled loop
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove0         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow0
      ja      @DividentAbove0
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow0
    @DividentAbove0:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow0:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove1         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow1
      ja      @DividentAbove1
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow1
    @DividentAbove1:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow1:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove2         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow2
      ja      @DividentAbove2
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow2
    @DividentAbove2:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow2:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove3         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow3
      ja      @DividentAbove3
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow3
    @DividentAbove3:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow3:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove4         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow4
      ja      @DividentAbove4
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow4
    @DividentAbove4:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow4:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove5         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow5
      ja      @DividentAbove5
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow5
    @DividentAbove5:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow5:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove6         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow6
      ja      @DividentAbove6
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow6
    @DividentAbove6:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow6:
      shl     bl, 1                   //Shift dividend left for one bit
      rcl     edi, 1
      rcl     esi, 1
      jc      @DividentAbove7         //dividend hi bit set?
      cmp     esi, edx                //dividend hi part larger?
      jb      @DividentBelow7
      ja      @DividentAbove7
      cmp     edi, ebp                //dividend lo part larger?
      jb      @DividentBelow7
    @DividentAbove7:
      sub     edi, ebp                //Return privius dividend state
      sbb     esi, edx
    @DividentBelow7:
    //End of Compute 8 Bits (unroled loop)
      dec     cl                      //Decrement byte loop shift counter
      jns     @Shift8Bits             //Last jump at cl = 0!!!
    //End of division loop
      mov     eax, edi                //Load result to eax:edx
      mov     edx, esi
    @RestoreRegisters:
      pop     ebp                     //Restore Registers
      pop     edi
      pop     esi
      pop     ebx
      ret
    @DivByZero:
      xor     eax, eax                //Here you can raise Div by 0 exception, now function only return 0.
      xor     edx, edx
      jmp     @RestoreRegisters
    end;
    
    0 讨论(0)
  • 2020-12-01 00:28

    Given A = AH*2^64 + AL:

    A % B == (((AH % B) * (2^64 % B)) + (AL % B)) % B
          == (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
    

    If your compiler supports 64-bit integers, then this is probably the easiest way to go. MSVC's implementation of a 64-bit modulo on 32-bit x86 is some hairy loop filled assembly (VC\crt\src\intel\llrem.asm for the brave), so I'd personally go with that.

    0 讨论(0)
  • 2020-12-01 00:31

    As a general rule, division is slow and multiplication is faster, and bit shifting is faster yet. From what I have seen of the answers so far, most of the answers have been using a brute force approach using bit-shifts. There exists another way. Whether it is faster remains to be seen (AKA profile it).

    Instead of dividing, multiply by the reciprocal. Thus, to discover A % B, first calculate the reciprocal of B ... 1/B. This can be done with a few loops using the Newton-Raphson method of convergence. To do this well will depend upon a good set of initial values in a table.

    For more details on the Newton-Raphson method of converging on the reciprocal, please refer to http://en.wikipedia.org/wiki/Division_(digital)

    Once you have the reciprocal, the quotient Q = A * 1/B.

    The remainder R = A - Q*B.

    To determine if this would be faster than the brute force (as there will be many more multiplies since we will be using 32-bit registers to simulate 64-bit and 128-bit numbers, profile it.

    If B is constant in your code, you can pre-calculate the reciprocal and simply calculate using the last two formulae. This, I am sure will be faster than bit-shifting.

    Hope this helps.

    0 讨论(0)
  • 2020-12-01 00:36

    I'd like to share a few thoughts.

    It's not as simple as MSN proposes I'm afraid.

    In the expression:

    (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
    

    both multiplication and addition may overflow. I think one could take it into account and still use the general concept with some modifications, but something tells me it's going to get really scary.

    I was curious how 64 bit modulo operation was implemented in MSVC and I tried to find something out. I don't really know assembly and all I had available was Express edition, without the source of VC\crt\src\intel\llrem.asm, but I think I managed to get some idea what's going on, after a bit of playing with the debugger and disassembly output. I tried to figure out how the remainder is calculated in case of positive integers and the divisor >=2^32. There is some code that deals with negative numbers of course, but I didn't dig into that.

    Here is how I see it:

    If divisor >= 2^32 both the dividend and the divisor are shifted right as much as necessary to fit the divisor into 32 bits. In other words: if it takes n digits to write the divisor down in binary and n > 32, n-32 least significant digits of both the divisor and the dividend are discarded. After that, the division is performed using hardware support for dividing 64 bit integers by 32 bit ones. The result might be incorrect, but I think it can be proved, that the result may be off by at most 1. After the division, the divisor (original one) is multiplied by the result and the product subtracted from the dividend. Then it is corrected by adding or subtracting the divisor if necessary (if the result of the division was off by one).

    It's easy to divide 128 bit integer by 32 bit one leveraging hardware support for 64-bit by 32-bit division. In case the divisor < 2^32, one can calculate the remainder performing just 4 divisions as follows:

    Let's assume the dividend is stored in:

    DWORD dividend[4] = ...
    

    the remainder will go into:

    DWORD remainder;
    
    1) Divide dividend[3] by divisor. Store the remainder in remainder.
    2) Divide QWORD (remainder:dividend[2]) by divisor. Store the remainder in remainder.
    3) Divide QWORD (remainder:dividend[1]) by divisor. Store the remainder in remainder.
    4) Divide QWORD (remainder:dividend[0]) by divisor. Store the remainder in remainder.
    

    After those 4 steps the variable remainder will hold what You are looking for. (Please don't kill me if I got the endianess wrong. I'm not even a programmer)

    In case the divisor is grater than 2^32-1 I don't have good news. I don't have a complete proof that the result after the shift is off by no more than 1, in the procedure I described earlier, which I believe MSVC is using. I think however that it has something to do with the fact, that the part that is discarded is at least 2^31 times less than the divisor, the dividend is less than 2^64 and the divisor is greater than 2^32-1, so the result is less than 2^32.

    If the dividend has 128 bits the trick with discarding bits won't work. So in general case the best solution is probably the one proposed by GJ or caf. (Well, it would be probably the best even if discarding bits worked. Division, multiplication subtraction and correction on 128 bit integer might be slower.)

    I was also thinking about using the floating point hardware. x87 floating point unit uses 80 bit precision format with fraction 64 bits long. I think one can get the exact result of 64 bit by 64 bit division. (Not the remainder directly, but also the remainder using multiplication and subtraction like in the "MSVC procedure"). IF the dividend >=2^64 and < 2^128 storing it in the floating point format seems similar to discarding least significant bits in "MSVC procedure". Maybe someone can prove the error in that case is bound and find it useful. I have no idea if it has a chance to be faster than GJ's solution, but maybe it's worth it to try.

    0 讨论(0)
提交回复
热议问题