问题
That is, what is the fastest way to do the test
if( a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ) )
...
if a, b, and c are all unsigned char
aka BYTE
. I am trying to optimize an image scanning process to find a sub-image, and a comparison such as this is done about 3 million times per scan, so even minor optimizations could be helpful.
Not sure, but maybe some sort of bitwise operation? Maybe adding 1 to c and testing for less-than and greater-than without the or-equal-to part? I don't know!
回答1:
Well, first of all let's see what you are trying to check without all kinds of over/underflow checks:
a >= b - c
a <= b + c
subtract b from both:
a - b >= -c
a - b <= c
Now that is equal to
abs(a - b) <= c
And in code:
(a>b ? a-b : b-a) <= c
Now, this code is a tad faster and doesn't contain (or need) complicated underflow/overflow checks.
I've profiled mine and 6502's code with 1000000000 repitions and there officially was no difference whatsoever. I would suggest to pick the most elegant solution (which is IMO mine, but opinions differ), since performance is not an argument.
However, there was a notable difference between my and the asker's code. This is the profiling code I used:
#include <iostream>
int main(int argc, char *argv[]) {
bool prevent_opti;
for (int ai = 0; ai < 256; ++ai) {
for (int bi = 0; bi < 256; ++bi) {
for (int ci = 0; ci < 256; ++ci) {
unsigned char a = ai;
unsigned char b = bi;
unsigned char c = ci;
if ((a>b ? a-b : b-a) <= c) prevent_opti = true;
}
}
}
std::cout << prevent_opti << "\n";
return 0;
}
With my if statement this took 120ms on average and the asker's if statement took 135ms on average.
回答2:
It think you will get the best performance by writing it in clearest way possible then turning on the compilers optimizers. The compiler is rather good at this kind of optimization and will beat you most of the time (in the worst case it will equal you).
My preference would be:
int min = (b-c) > 0 ? (b-c) : 0 ;
int max = (b+c) < 255? (b+c) : 255;
if ((a >= min) && ( a<= max))
The original code: (in assembley)
movl %eax, %ecx
movl %ebx, %eax
subl %ecx, %eax
movl $0, %edx
cmovs %edx, %eax
cmpl %eax, %r12d
jl L13
leal (%rcx,%rbx), %eax
cmpl $255, %eax
movb $-1, %dl
cmovg %edx, %eax
cmpl %eax, %r12d
jmp L13
My Code (in assembley)
movl %eax, %ecx
movl %ebx, %eax
subl %ecx, %eax
movl $0, %edx
cmovs %edx, %eax
cmpl %eax, %r12d
jl L13
leal (%rcx,%rbx), %eax
cmpl $255, %eax
movb $-1, %dl
cmovg %edx, %eax
cmpl %eax, %r12d
jg L13
nightcracker's code (in assembley)
movl %r12d, %edx
subl %ebx, %edx
movl %ebx, %ecx
subl %r12d, %ecx
cmpl %ebx, %r12d
cmovle %ecx, %edx
cmpl %eax, %edx
jg L16
回答3:
Just using plain int
s for a
, b
and c
will allow you to change the code to the simpler
if (a >= b - c && a <= b + c) ...
Also, as an alternative, 256*256*256 is just 16M and a map of 16M bits is 2 MBytes. This means that it's feasible to use a lookup table like
int index = (a<<16) + (b<<8) + c;
if (lookup_table[index>>3] & (1<<(index&7))) ...
but I think that the cache trashing will make this much slower even if modern processors hate conditionals...
Another alternative is to use a bit of algebra
b - c <= a <= b + c
iff
- c <= a - b <= c (subtracted b from all terms)
iff
0 <= a - b + c <= 2*c (added c to all terms)
this allows to use just one test
if ((unsigned)(a - b + c) < 2*c) ...
assuming that a
, b
and c
are plain int
s. The reason is that if a - b + c
is negative then unsigned arithmetic will make it much bigger than 2*c
(if c
is 0..255).
This should generate efficent machine code with a single branch if the processor has dedicated signed/unsigned comparison instructions like x86 (ja/jg).
#include <stdio.h>
int main()
{
int err = 0;
for (int ia=0; ia<256; ia++)
for (int ib=0; ib<256; ib++)
for (int ic=0; ic<256; ic++)
{
unsigned char a = ia;
unsigned char b = ib;
unsigned char c = ic;
int res1 = (a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ));
int res2 = (unsigned(a - b + c) <= 2*c);
err += (res1 != res2);
}
printf("Errors = %i\n", err);
return 0;
}
On x86 with g++ the assembler code generated for the res2
test only includes one conditional instruction.
The assembler code for the following loop is
void process(unsigned char *src, unsigned char *dst, int sz)
{
for (int i=0; i<sz; i+=3)
{
unsigned char a = src[i];
unsigned char b = src[i+1];
unsigned char c = src[i+2];
dst[i] = (unsigned(a - b + c) <= 2*c);
}
}
.L3:
movzbl 2(%ecx,%eax), %ebx ; This loads c
movzbl (%ecx,%eax), %edx ; This loads a
movzbl 1(%ecx,%eax), %esi ; This loads b
leal (%ebx,%edx), %edx ; This computes a + c
addl %ebx, %ebx ; This is c * 2
subl %esi, %edx ; This is a - b + c
cmpl %ebx, %edx ; Comparison
setbe (%edi,%eax) ; Set 0/1 depending on result
addl $3, %eax ; next group
cmpl %eax, 16(%ebp) ; did we finish ?
jg .L3 ; if not loop back for next
Using instead dst[i] = (a<b ? b-a : a-b);
the code becomes much longer
.L9:
movzbl %dl, %edx
andl $255, %esi
subl %esi, %edx
.L4:
andl $255, %edi
cmpl %edi, %edx
movl 12(%ebp), %edx
setle (%edx,%eax)
addl $3, %eax
cmpl %eax, 16(%ebp)
jle .L6
.L5:
movzbl (%ecx,%eax), %edx
movb %dl, -13(%ebp)
movzbl 1(%ecx,%eax), %esi
movzbl 2(%ecx,%eax), %edi
movl %esi, %ebx
cmpb %bl, %dl
ja .L9
movl %esi, %ebx
movzbl %bl, %edx
movzbl -13(%ebp), %ebx
subl %ebx, %edx
jmp .L4
.p2align 4,,7
.p2align 3
.L6:
And I'm way too tired now to try to decipher it (2:28 AM here)
Anyway longer doesn't mean necessarely slower (at a first sight seems g++ decided to unroll the loop by writing a few elements at a time in this case).
As I said before you should do some actual profiling with your real computation and your real data. Note that if true performance is needed may be that the best strategy will differ depending on the processor.
For example Linux during bootstrap makes ae test to decide what is the faster way to perform a certain computation that is needed in the kernel. The variables are just too many (cache size/levels, ram speed, cpu clock, chipset, cpu type...).
回答4:
Rarely does embedding the ternary operator in another statement improve performance :)
If every single op code matters, write the op codes yourself - use assembler. Also consider using simd instructions if possible. I'd also be interested in the target platform. ARM assembler loves compares of this sort and has opcodes to speed up saturated math of this type.
来源:https://stackoverflow.com/questions/5924611/what-is-the-best-way-performance-wise-to-test-whether-a-value-falls-within-a-t