Fastest way to set a Carry Flag

问题

I'm doing a cycle to sum two arrays. My objective is do it by avoiding carry checks c = a + b; carry = (c<a). I lost the CF when I do the loop test, with the cmp instruction.

Currently, i am using and the JEand STC to test and set the previously saved state of CF. But the jump takes more less 7 cycles, what it is a lot for what I want.

   //This one is working
   asm(
        "cmp $0,%0;"
        "je 0f;"
        "stc;"
    "0:"   
        "adcq %2, %1;"
        "setc %0"

    : "+r" (carry), "+r" (anum)
    : "r" (bnum)
   );

I already tried use the SAHF (2 + 2(mov) cycles), but that do not worked.

   //Do not works
   asm(
        "mov %0, %%ah;"
        "sahf;"
        "adcq %2, %1;"
        "setc %0"

        : "+r" (carry), "+r" (anum)
        : "r" (bnum)
   );

Anyone knows a way to set the CF more quickly? Like a direct move or something similar..

回答1:

Looping without clobbering CF will be faster. See that link for some better asm loops.

Don't try to write just the adc with inline asm inside a C loop. It's impossible for that to be optimal, because you can't ask gcc not to clobber flags. Trying to learn asm with GNU C inline asm is harder than writing a stand-alone function, esp. in this case where you are trying to preserve the carry flag.

You could use setnc %[carry] to save and subb $1, %[carry] to restore. (Or cmpb $1, %[carry] I guess.) Or as Stephen points out, negb %[carry].

0 - 1 produces a carry, but 1 - 1 doesn't.

Use a uint8_t to variable to hold the carry, since you will never add it directly to %[anum]. This avoids any chance of partial-register slowdowns. e.g.

uint8_t carry = 0;
int64_t numa, numb;

for (...) {
    asm ( "negb   %[carry]\n\t"
          "adc    %[bnum], %[anum]\n\t"
          "setc   %[carry]\n\t"
          : [carry] "+&r" (carry), [anum] "+r" (anum)
          : [bnum] "rme" (bnum)
          : // no clobbers
        );
}

You could also provide an alternate constraint pattern for register source, reg/mem dest. I used an x86 "e" constraint instead of "i", because 64bit mode still only allows 32bit sign-extended immediates. gcc will have to get larger compile-time constants into a register on its own. Carry is early-clobbered, so even if it and bnum were both 1 to start with, gcc couldn't use the same register for both inputs.

This is still terrible, and increases the length of the loop-carried dependency chain from 2c to 4c (Intel pre-Broadwell), or from 1c to 3c (Intel BDW/Skylake, and AMD).

So your loop runs at 1/3rd speed because you're using a kludge instead of writing the whole loop in asm.

A previous version of this answer suggested adding the carry directly, instead of restoring it into CF. This approach has a fatal flaw: it mixed up the incoming carry into this iteration with the outgoing carry going to the next iteration.

Also, sahf is Set AH from Flags. lahf is Load AH into Flags (and it operates on the whole low 8 bits of flags. Pair those instructions; don't use lahf on a 0 or 1 that you got from setc.

Read the insn set reference manual for any insns that don't seem to be doing what you expect. See https://stackoverflow.com/tags/x86/info

回答2:

If the array size is known at compile time, you could do something like this:

#include <inttypes.h>
#include <malloc.h>
#include <stdio.h>
#include <memory.h>

#define str(s) #s
#define xstr(s) str(s)

#define ARRAYSIZE 4

asm(".macro AddArray2 p1, p2, from, to\n\t"
    "movq (\\from*8)(\\p2), %rax\n\t"
    "adcq %rax, (\\from*8)(\\p1)\n\t"
    ".if \\to-\\from\n\t"
    "   AddArray2 \\p1, \\p2, \"(\\from+1)\", \\to\n\t"
    ".endif\n\t"
    ".endm\n");

asm(".macro AddArray p1, p2, p3\n\t"
    "movq (\\p2), %rax\n\t"
    "addq %rax, (\\p1)\n\t"
    ".if \\p3-1\n\t"
    "   AddArray2 \\p1, \\p2, 1, (\\p3-1)\n\t"
    ".endif\n\t"
    ".endm");

int main()
{
   unsigned char carry;

   // assert(ARRAYSIZE > 0);

   // Create the arrays
   uint64_t *anum = (uint64_t *)malloc(ARRAYSIZE * sizeof(uint64_t));
   uint64_t *bnum = (uint64_t *)malloc(ARRAYSIZE * sizeof(uint64_t));

   // Put some data in
   memset(anum, 0xff, ARRAYSIZE * sizeof(uint64_t));
   memset(bnum, 0, ARRAYSIZE * sizeof(uint64_t));
   bnum[0] = 1;

   // Print the arrays before the add
   printf("anum: ");
   for (int x=0; x < ARRAYSIZE; x++)
   {
      printf("%I64x ", anum[x]);
   }
   printf("\nbnum: ");
   for (int x=0; x < ARRAYSIZE; x++)
   {
      printf("%I64x ", bnum[x]);
   }
   printf("\n");

   // Add the arrays
   asm ("AddArray %[anum], %[bnum], " xstr(ARRAYSIZE) "\n\t"
        "setc %[carry]" // Get the flags from the final add

       : [carry] "=q"(carry)
       : [anum] "r" (anum), [bnum] "r" (bnum)
       : "rax", "cc", "memory"
   );

   // Print the result
   printf("Result: ");
   for (int x=0; x < ARRAYSIZE; x++)
   {
      printf("%I64x ", anum[x]);
   }
   printf(": %d\n", carry);
}

This gives code like this:

mov    (%rsi),%rax
add    %rax,(%rbx)
mov    0x8(%rsi),%rax
adc    %rax,0x8(%rbx)
mov    0x10(%rsi),%rax
adc    %rax,0x10(%rbx)
mov    0x18(%rsi),%rax
adc    %rax,0x18(%rbx)
setb   %bpl

Since adding 1 to all f's will completely overflow everything, the output from the code above is:

anum: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
bnum: 1 0 0 0
Result: 0 0 0 0 : 1

As written, ARRAYSIZE can be up to about 100 elements (due to gnu's macro depth nesting limits). Seems like it should be enough...

来源：https://stackoverflow.com/questions/35298875/fastest-way-to-set-a-carry-flag

标签

c++

assembly

inline-assembly

carryflag