Getting the high part of 64 bit integer multiplication

后端未结

关注

 5  1461

無奈伤痛 2020-11-28 13:30

In C++, say that:

uint64_t i;
uint64_t j;

then i * j will yield an uint64_t that has as value the lower part of t

5条回答

無奈伤痛 (楼主)

2020-11-28 14:15
TL:DR with GCC for a 64-bit ISA: (a * (unsigned __int128)b) >> 64 compiles nicely, to a single full-multiply or high-half multiply instruction. No need to mess around with inline asm.

Unfortunately current compilers don't optimize @craigster0's nice portable version, so if you want to take advantage of 64-bit CPUs, you can't use it except as a fallback for targets you don't have an #ifdef for. (I don't see a generic way to optimize it; you need a 128-bit type or an intrinsic.)

GNU C (gcc, clang, or ICC) has unsigned __int128 on most 64-bit platforms. (Or in older versions, __uint128_t). GCC doesn't implement this type on 32-bit platforms, though.

This is an easy and efficient way to get the compiler to emit a 64-bit full-multiply instruction and keep the high half. (GCC knows that a uint64_t cast to a 128-bit integer still has the upper half all zero, so you don't get a 128-bit multiply using three 64-bit multiplies.)

MSVC also has a __umulh intrinsic for 64-bit high-half multiplication, but again it's only available on 64-bit platforms (and specifically x86-64 and AArch64. The docs also mention IPF (IA-64) having _umul128 available, but I don't have MSVC for Itanium available. (Probably not relevant anyway.)
```
#define HAVE_FAST_mul64 1

#ifdef __SIZEOF_INT128__     // GNU C
 static inline
 uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int128 prod =  a * (unsigned __int128)b;
     return prod >> 64;
 }

#elif defined(_M_X64) || defined(_M_ARM64)     // MSVC
   // MSVC for x86-64 or AArch64
   // possibly also  || defined(_M_IA64) || defined(_WIN64)
   // but the docs only guarantee x86-64!  Don't use *just* _WIN64; it doesn't include AArch64 Android / Linux

  // https://docs.microsoft.com/en-gb/cpp/intrinsics/umulh
  #include 
  #define mulhi64 __umulh

#elif defined(_M_IA64) // || defined(_M_ARM)       // MSVC again
  // https://docs.microsoft.com/en-gb/cpp/intrinsics/umul128
  // incorrectly say that _umul128 is available for ARM
  // which would be weird because there's no single insn on AArch32
  #include 
  static inline
  uint64_t mulhi64(uint64_t a, uint64_t b) {
     unsigned __int64 HighProduct;
     (void)_umul128(a, b, &HighProduct);
     return HighProduct;
  }

#else

# undef HAVE_FAST_mul64
  uint64_t mulhi64(uint64_t a, uint64_t b);  // non-inline prototype
  // or you might want to define @craigster0's version here so it can inline.
#endif
```
For x86-64, AArch64, and PowerPC64 (and others), this compiles to one mul instruction, and a couple movs to deal with the calling convention (which should optimize away after this inlines). From the Godbolt compiler explorer (with source + asm for x86-64, PowerPC64, and AArch64):
```
     # x86-64 gcc7.3.  clang and ICC are the same.  (x86-64 System V calling convention)
     # MSVC makes basically the same function, but with different regs for x64 __fastcall
    mov     rax, rsi
    mul     rdi              # RDX:RAX = RAX * RDI
    mov     rax, rdx
    ret
```
(or with clang -march=haswell to enable BMI2: mov rdx, rsi / mulx rax, rcx, rdi to put the high-half in RAX directly. gcc is dumb and still uses an extra mov.)

For AArch64 (with gcc unsigned __int128 or MSVC with __umulh):
```
test_var:
    umulh   x0, x0, x1
    ret
```
With a compile-time constant power of 2 multiplier, we usually get the expected right-shift to grab a few high bits. But gcc amusingly uses shld (see the Godbolt link).

Unfortunately current compilers don't optimize @craigster0's nice portable version. You get 8x shr r64,32, 4x imul r64,r64, and a bunch of add/mov instructions for x86-64. i.e. it compiles to a lot of 32x32 => 64-bit multiplies and unpacks of the results. So if you want something that takes advantage of 64-bit CPUs, you need some #ifdefs.

A full-multiply mul 64 instruction is 2 uops on Intel CPUs, but still only 3 cycle latency, same as imul r64,r64 which only produces a 64-bit result. So the __int128 / intrinsic version is 5 to 10 times cheaper in latency and throughput (impact on surrounding code) on modern x86-64 than the portable version, from a quick eyeball guess based on http://agner.org/optimize/.

Check it out on the Godbolt compiler explorer on the above link.

gcc does fully optimize this function when multiplying by 16, though: you get a single right shift, more efficient than with unsigned __int128 multiply.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...