问题
I'm doing a cycle to sum two arrays. My objective is do it by avoiding carry checks c = a + b; carry = (c<a)
. I lost the CF
when I do the loop test, with the cmp
instruction.
Currently, i am using and the JE
and STC
to test and set the previously saved state of CF
. But the jump takes more less 7 cycles, what it is a lot for what I want.
//This one is working
asm(
"cmp $0,%0;"
"je 0f;"
"stc;"
"0:"
"adcq %2, %1;"
"setc %0"
: "+r" (carry), "+r" (anum)
: "r" (bnum)
);
I already tried use the SAHF
(2 + 2(mov) cycles), but that do not worked.
//Do not works
asm(
"mov %0, %%ah;"
"sahf;"
"adcq %2, %1;"
"setc %0"
: "+r" (carry), "+r" (anum)
: "r" (bnum)
);
Anyone knows a way to set the CF
more quickly? Like a direct move or something similar..
回答1:
Looping without clobbering CF will be faster. See that link for some better asm loops.
Don't try to write just the adc
with inline asm inside a C loop. It's impossible for that to be optimal, because you can't ask gcc not to clobber flags. Trying to learn asm with GNU C inline asm is harder than writing a stand-alone function, esp. in this case where you are trying to preserve the carry flag.
You could use setnc %[carry]
to save and subb $1, %[carry]
to restore. (Or cmpb $1, %[carry]
I guess.) Or as Stephen points out, negb %[carry]
.
0 - 1
produces a carry, but 1 - 1
doesn't.
Use a uint8_t
to variable to hold the carry, since you will never add it directly to %[anum]
. This avoids any chance of partial-register slowdowns. e.g.
uint8_t carry = 0;
int64_t numa, numb;
for (...) {
asm ( "negb %[carry]\n\t"
"adc %[bnum], %[anum]\n\t"
"setc %[carry]\n\t"
: [carry] "+&r" (carry), [anum] "+r" (anum)
: [bnum] "rme" (bnum)
: // no clobbers
);
}
You could also provide an alternate constraint pattern for register source, reg/mem dest. I used an x86 "e" constraint instead of "i"
, because 64bit mode still only allows 32bit sign-extended immediates. gcc will have to get larger compile-time constants into a register on its own. Carry is early-clobbered, so even if it and bnum
were both 1
to start with, gcc couldn't use the same register for both inputs.
This is still terrible, and increases the length of the loop-carried dependency chain from 2c to 4c (Intel pre-Broadwell), or from 1c to 3c (Intel BDW/Skylake, and AMD).
So your loop runs at 1/3rd speed because you're using a kludge instead of writing the whole loop in asm.
A previous version of this answer suggested adding the carry directly, instead of restoring it into CF
. This approach has a fatal flaw: it mixed up the incoming carry into this iteration with the outgoing carry going to the next iteration.
Also, sahf
is Set AH from Flags. lahf
is Load AH into Flags (and it operates on the whole low 8 bits of flags. Pair those instructions; don't use lahf
on a 0 or 1 that you got from setc
.
Read the insn set reference manual for any insns that don't seem to be doing what you expect. See https://stackoverflow.com/tags/x86/info
回答2:
If the array size is known at compile time, you could do something like this:
#include <inttypes.h>
#include <malloc.h>
#include <stdio.h>
#include <memory.h>
#define str(s) #s
#define xstr(s) str(s)
#define ARRAYSIZE 4
asm(".macro AddArray2 p1, p2, from, to\n\t"
"movq (\\from*8)(\\p2), %rax\n\t"
"adcq %rax, (\\from*8)(\\p1)\n\t"
".if \\to-\\from\n\t"
" AddArray2 \\p1, \\p2, \"(\\from+1)\", \\to\n\t"
".endif\n\t"
".endm\n");
asm(".macro AddArray p1, p2, p3\n\t"
"movq (\\p2), %rax\n\t"
"addq %rax, (\\p1)\n\t"
".if \\p3-1\n\t"
" AddArray2 \\p1, \\p2, 1, (\\p3-1)\n\t"
".endif\n\t"
".endm");
int main()
{
unsigned char carry;
// assert(ARRAYSIZE > 0);
// Create the arrays
uint64_t *anum = (uint64_t *)malloc(ARRAYSIZE * sizeof(uint64_t));
uint64_t *bnum = (uint64_t *)malloc(ARRAYSIZE * sizeof(uint64_t));
// Put some data in
memset(anum, 0xff, ARRAYSIZE * sizeof(uint64_t));
memset(bnum, 0, ARRAYSIZE * sizeof(uint64_t));
bnum[0] = 1;
// Print the arrays before the add
printf("anum: ");
for (int x=0; x < ARRAYSIZE; x++)
{
printf("%I64x ", anum[x]);
}
printf("\nbnum: ");
for (int x=0; x < ARRAYSIZE; x++)
{
printf("%I64x ", bnum[x]);
}
printf("\n");
// Add the arrays
asm ("AddArray %[anum], %[bnum], " xstr(ARRAYSIZE) "\n\t"
"setc %[carry]" // Get the flags from the final add
: [carry] "=q"(carry)
: [anum] "r" (anum), [bnum] "r" (bnum)
: "rax", "cc", "memory"
);
// Print the result
printf("Result: ");
for (int x=0; x < ARRAYSIZE; x++)
{
printf("%I64x ", anum[x]);
}
printf(": %d\n", carry);
}
This gives code like this:
mov (%rsi),%rax
add %rax,(%rbx)
mov 0x8(%rsi),%rax
adc %rax,0x8(%rbx)
mov 0x10(%rsi),%rax
adc %rax,0x10(%rbx)
mov 0x18(%rsi),%rax
adc %rax,0x18(%rbx)
setb %bpl
Since adding 1 to all f's will completely overflow everything, the output from the code above is:
anum: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
bnum: 1 0 0 0
Result: 0 0 0 0 : 1
As written, ARRAYSIZE can be up to about 100 elements (due to gnu's macro depth nesting limits). Seems like it should be enough...
来源:https://stackoverflow.com/questions/35298875/fastest-way-to-set-a-carry-flag