Signed saturated add of 64-bit ints?

问题

I'm looking for some C code for signed saturated 64-bit addition that compiles to efficient x86-64 code with the gcc optimizer. Portable code would be ideal, although an asm solution could be used if necessary.

static const int64 kint64max = 0x7fffffffffffffffll;
static const int64 kint64min = 0x8000000000000000ll;

int64 signed_saturated_add(int64 x, int64 y) {
  bool x_is_negative = (x & kint64min) != 0;
  bool y_is_negative = (y & kint64min) != 0;
  int64 sum = x+y;
  bool sum_is_negative = (sum & kint64min) != 0;
  if (x_is_negative != y_is_negative) return sum;  // can't overflow
  if (x_is_negative && !sum_is_negative) return kint64min;
  if (!x_is_negative && sum_is_negative) return kint64max;
  return sum;
}

The function as written produces a fairly lengthy assembly output with several branches. Any tips on optimization? Seems like it ought to be be implementable with just an ADD with a few CMOV instructions but I'm a little bit rusty with this stuff.

回答1:

This may be optimized further but here is a portable solution. It does not invoked undefined behavior and it checks for integer overflow before it could occur.

#include <stdint.h>

int64_t sadd64(int64_t a, int64_t b)
{
    if (a > 0) {
        if (b > INT64_MAX - a) {
            return INT64_MAX;
        }
    } else if (b < INT64_MIN - a) {
            return INT64_MIN;
    }

    return a + b;
}

回答2:

This is a solution that continues in the vein that had been given in one of the comments, and has been used in ouah's solution, too. here the generated code should be without conditional jumps

int64_t signed_saturated_add(int64_t x, int64_t y) {
  // determine the lower or upper bound of the result
  int64_t ret =  (x < 0) ? INT64_MIN : INT64_MAX;
  // this is always well defined:
  // if x < 0 this adds a positive value to INT64_MIN
  // if x > 0 this subtracts a positive value from INT64_MAX
  int64_t comp = ret - x;
  // the condition is equivalent to
  // ((x < 0) && (y > comp)) || ((x >=0) && (y <= comp))
  if ((x < 0) == (y > comp)) ret = x + y;
  return ret;
}

The first looks as if there would be a conditional move to do, but because of the special values my compiler gets off with an addition: in 2's complement INT64_MIN is INT64_MAX+1. There is then only one conditional move for the assignment of the sum, in case anything is fine.

All of this has no UB, because in the abstract state machine the sum is only done if there is no overflow.

回答3:

Related: unsigned saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C?

Compilers are terrible at all of the pure C options proposed so far.

They don't see that they can use the signed-overflow flag result from an add instruction to detect that saturation to INT64_MIN/MAX is necessary. AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add.

Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.)

This version uses the fact that INT64_MIN = INT64_MAX + 1ULL (for 2's complement) to select INT64_MIN/MAX based on the sign of b. Signed-overflow UB is avoided by using uint64_t for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX. (See below for the alternate version using that). I haven't checked the asm on 32-bit architectures.

GCC only supports 2's complement integer types, so a function using __builtin_add_overflow doesn't have to care about portability to C implementations that use 1's complement (where the same identity holds) or sign/magnitude (where it doesn't), even if you made a version for long or int instead of int64_t.

#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif

// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
    long long res;
    bool overflow = __builtin_saddll_overflow(a, b, &res);
    if (overflow) {
            // overflow is only possible in one direction depending on the sign bit
            return ((uint64_t)b >> 63) + INT64_MAX;
            // INT64_MIN = INT64_MAX + 1  wraparound, done with unsigned
    }

    return res;
}

Another option is (b>>63) ^ INT64_MAX which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd) See Add saturate 32-bit signed ints intrinsics? with x86 SIMD intrinsics (SSE2/SSE4)

On Godbolt with gcc9 and clang8 (along with the other implementations from other answers):

# gcc9.1 -O3   (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:

        add     rdi, rsi                   # the actual add
        jo      .L3                        # jump on signed overflow
        mov     rax, rdi                   # retval = the non-overflowing add result
        ret
.L3:
        movabs  rax, 9223372036854775807   # INT64_MAX
        shr     rsi, 63                    # b is still available after the ADD
        add     rax, rsi
        ret

When inlining into a loop, the mov imm64 can be hoisted. If b is needed afterwards then we might need an extra mov, otherwise shr/add can destroy b, leaving the INT64_MAX constant in a register undamaged. Or if the compiler wants to use cmov (like clang does), it has to mov/shr because it has to get the saturation constant ready before the add, preserving both operands.

Notice that the critical path for the non-overflowing case only includes an add and a not-taken jo. These can't macro-fuse into a single uop even on Sandybridge-family, but the jo only costs throughput not latency thanks to branch prediction + speculative execution. (When inlining, the mov will go away.)

If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno instead of jo, like clang does. This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. The MIN/MAX selection can run in parallel with the add.

You could use a<0 instead. I used b because I think most people would write x = sadd(x, 123) instead of x = sadd(123, x), and having a compile-time-constant allows the b<0 to optimize away. For maximal optimization opportunity, you could use if (__builtin_constant_p(a)) to ask the compiler if a was a compile-time constant. That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.)

This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add + cmovo. (Or add + jo to a mov.)

cmov is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. On most such CPUs it has 1 cycle latency. It's a simple ALU select operation; the only complication is 3 inputs (2 regs + FLAGS). But on KNL it's still 2-cycle latency.

Unfortunately gcc for AArch64 fails to use adds to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch.

Clang does a great job, and AArch64's immediate encodings can represent INT64_MAX as an operand to eor or add.

// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
    orr     x9, xzr, #0x7fffffffffffffff      // mov constant = OR with zero reg
    adds    x8, x0, x1                        // add and set flags
    add     x9, x9, x1, lsr #63               // sat = (b shr 63) + MAX
    csel    x0, x9, x8, vs                    // conditional-select, condition = VS = oVerflow flag Set
    ret

Optimizing `MIN` vs. `MAX` selection

As noted above, return (b<0) ? INT64_MIN : INT64_MAX; doesn't compile optimally with most versions of gcc/clang; they generate both constant in registers and cmov to select, or something similar on other ISAs.

We can assume 2's complement because GCC only supports 2's complement integer types, and because the ISO C optional int64_t type is guaranteed to be 2's complement if it exists. (Signed overflow of int64_t is still UB, this allows it to be a simple typedef of long or long long).

(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow, a version of this function for long long or int couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF... to OR the sign bit of b into a max-magnitude number.)

For two's complement, the bit-patterns for MIN and MAX are 0x8000... (just the high bit set) and 0x7FFF... (all other bits set). They have a couple interesting properties: MIN = MAX + 1 (if computed with unsigned on the bit-pattern), and MIN = ~MAX: their bit-patterns are bitwise inverses, aka one's complement of each other.

The MIN = ~MAX property follows from ~x = -x - 1 (a re-arrangement of the standard -x = ~x + 1 2's complement identity) and the fact that MIN = -MAX - 1. The +1 property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well. (But not sign/magnitude; you'd get -0 where the unsigned magnitude ).

The above function uses the MIN = MAX + 1 trick. The MIN = ~MAX trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0 or 0xFF...), and XORing with that. GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX is equivalent to (b<0) ? INT64_MIN : INT64_MAX in GNU C.

ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL. Compilers will optimize the following to sar / xor, or equivalent instruction(s), but it has no implementation-defined behaviour. AArch64 can use a shifted input operand for eor just as well as it can for add.

        // an earlier version of this answer used this
        int64_t mask = (b<0) ? ~0ULL : 0;  // compiles to sar with good compilers, but is not implementation-defined.
        return mask ^ INT64_MAX;

Fun fact: AArch64 has a csinv instruction: conditional-select inverse. And it can put INT64_MIN into a register with a single 32-bit mov instruction, thanks to its powerful immediate encodings for simple bit-patterns. AArch64 GCC was already using csinv for the MIN = ~MAX trick for the original return (b<0) ? INT64_MIN : INT64_MAX; version.

clang 6.0 and earlier on Godbolt were using shr/add for the plain (b<0) ? INT64_MIN : INT64_MAX; version. It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. (And it's the whole point of this section and why I wrote a 2nd version.)

I chose the MIN = MAX + 1 version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F. Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. Well maybe AVX2. And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.

For scalar it's truly a wash; neither way compiles any better, and sar/shr perform identically, and so do add/xor, on all CPUs that Agner Fog has tested. (https://agner.org/optimize/).

But + can sometimes optimize into other things, though, so you could imagine gcc folding a later + or - of a constant into the overflow branch. Or possibly using LEA for that add instead of ADD to copy-and-add. The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.

Also @chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary thing.

The earlier "simpler" version of this function

Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. Or in case a compiler does something better with this version.

int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
    long long res;
    bool overflow = __builtin_saddll_overflow(a, b, &res);
    if (overflow) {
            // overflow is only possible in one direction for a given `b`
            return (b<0) ? INT64_MIN : INT64_MAX;
    }
    return res;
}

which compiles as follows

# gcc9.1 -O3   (clang chooses branchless)
signed_sat_add64_gnuc:
        add     rdi, rsi                   # the actual add
        jo      .L3                        # jump on signed overflow
        mov     rax, rdi                   # retval = the non-overflowing add result
        ret
.L3:
        movabs  rdx, 9223372036854775807
        test    rsi, rsi                      # one of the addends is still available after
        movabs  rax, -9223372036854775808     # missed optimization: lea rdx, [rax+1]
        cmovns  rax, rdx                      # branchless selection of which saturation limit
        ret

This is basically what @drwowe's inline asm does, but with a test replacing one cmov. (And of course different conditions on the cmov.)

Another downside to this vs. the _v2 with shr/add is that this needs 2 constants. In a loop, this would tie up an extra register. (Again unless b is a compile-time constant.)

clang uses cmov instead of a branch, and does spot the lea rax, [rcx + 1] trick to avoid a 2nd 10-byte mov r64, imm64 instruction. (Or clang6.0 and earlier use the shr 63/add trick instead of that cmov.)

The first version of this answer put int64_t sat = (b<0) ? MIN : MAX outside the if(), but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. That's even better than running it off the critical path. (And doesn't matter if the compiler decides to go branchless).

But when I put it outside the if and after the __builtin_saddll_overflow, gcc was really dumb and saved the bool result in an integer, then did the test/cmov, then used test on the saddll_overflow result again to put it back in FLAGS. Reordering the source fixed that.

回答4:

I'm still looking for a decent portable solution, but this is as good as I've come up with so far:

Suggestions for improvements?

int64 saturated_add(int64 x, int64 y) {
#if __GNUC__ && __X86_64__
  asm("add %1, %0\n\t"
      "jno 1f\n\t"
      "cmovge %3, %0\n\t"
      "cmovl %2, %0\n"
      "1:" : "+r"(x) : "r"(y), "r"(kint64min), "r"(kint64max));
  return x;
#else
  return portable_saturated_add(x, y);
#endif
}

来源：https://stackoverflow.com/questions/17580118/signed-saturated-add-of-64-bit-ints

标签