Getting the high part of 64 bit integer multiplication

后端 未结 5 1431
無奈伤痛
無奈伤痛 2020-11-28 13:30

In C++, say that:

uint64_t i;
uint64_t j;

then i * j will yield an uint64_t that has as value the lower part of t

5条回答
  •  無奈伤痛
    2020-11-28 14:15

    TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64 compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.


    Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)


    GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms. (Or in older versions, __uint128_t). GCC doesn't implement this type on 32-bit platforms, though.

    This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)

    MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128 available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)

    #define HAVE_FAST_mul64 1
    
    #ifdef __SIZEOF_INT128__     // GNU C
     static inline
     uint64_t mulhi64(uint64_t a, uint64_t b) {
         unsigned __int128 prod =  a * (unsigned __int128)b;
         return prod >> 64;
     }
    
    #elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
       // MSVC for x86-64 or AArch64
       // possibly also  || defined(_M_IA64) || defined(_WIN64)
       // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux
    
      // https://docs.microsoft.com/en-gb/cpp/intrinsics/umulh
      #include 
      #define mulhi64 __umulh
    
    #elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
      // https://docs.microsoft.com/en-gb/cpp/intrinsics/umul128
      // incorrectly say that _umul128 is available for ARM
      // which would be weird because there's no single insn on AArch32
      #include 
      static inline
      uint64_t mulhi64(uint64_t a, uint64_t b) {
         unsigned __int64 HighProduct;
         (void)_umul128(a, b, &HighProduct);
         return HighProduct;
      }
    
    #else
    
    # undef HAVE_FAST_mul64
      uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
      // or you might want to define @craigster0's version here so it can inline.
    #endif
    

    For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul instruction, and a couple movs to deal with the calling convention (which should optimize away after this inlines). From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):

         # x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
         # MSVC makes basically the same function, but with different regs for x64 __fastcall
        mov     rax, rsi
        mul     rdi              # RDX:RAX = RAX * RDI
        mov     rax, rdx
        ret
    

    (or with clang -march=haswell to enable BMI2: mov rdx, rsi / mulx rax, rcx, rdi to put the high-half in RAX directly. gcc is dumb and still uses an extra mov.)

    For AArch64 (with gcc unsigned __int128 or MSVC with __umulh):

    test_var:
        umulh   x0, x0, x1
        ret
    

    With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld (see the Godbolt link).


    Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32, 4x imul r64,r64, and a bunch of add/mov instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdefs.

    A full-multiply mul 64 instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64 which only produces a 64-bit result. So the __int128 / intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.

    Check it out on the Godbolt compiler explorer on the above link.

    gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128 multiply.

提交回复
热议问题