What is the best way (performance-wise) to test whether a value falls within a threshold?

问题

That is, what is the fastest way to do the test

if( a >= ( b - c > 0 ? b - c : 0 ) &&
    a <= ( b + c < 255 ? b + c : 255 ) )

    ...

if a, b, and c are all unsigned char aka BYTE. I am trying to optimize an image scanning process to find a sub-image, and a comparison such as this is done about 3 million times per scan, so even minor optimizations could be helpful.

Not sure, but maybe some sort of bitwise operation? Maybe adding 1 to c and testing for less-than and greater-than without the or-equal-to part? I don't know!

回答1:

Well, first of all let's see what you are trying to check without all kinds of over/underflow checks:

a >= b - c
a <= b + c
subtract b from both:
a - b >= -c
a - b <= c

Now that is equal to

abs(a - b) <= c

And in code:

(a>b ? a-b : b-a) <= c

Now, this code is a tad faster and doesn't contain (or need) complicated underflow/overflow checks.

I've profiled mine and 6502's code with 1000000000 repitions and there officially was no difference whatsoever. I would suggest to pick the most elegant solution (which is IMO mine, but opinions differ), since performance is not an argument.

However, there was a notable difference between my and the asker's code. This is the profiling code I used:

#include <iostream>

int main(int argc, char *argv[]) {  
    bool prevent_opti;
    for (int ai = 0; ai < 256; ++ai) {
        for (int bi = 0; bi < 256; ++bi) {
            for (int ci = 0; ci < 256; ++ci) {
                unsigned char a = ai;
                unsigned char b = bi;
                unsigned char c = ci;
                if ((a>b ? a-b : b-a) <= c) prevent_opti = true;
            }
        }
    }

    std::cout << prevent_opti << "\n";

    return 0;
}

With my if statement this took 120ms on average and the asker's if statement took 135ms on average.

回答2:

It think you will get the best performance by writing it in clearest way possible then turning on the compilers optimizers. The compiler is rather good at this kind of optimization and will beat you most of the time (in the worst case it will equal you).

My preference would be:

int min = (b-c) > 0  ? (b-c) : 0 ;
int max = (b+c) < 255? (b+c) : 255;

if ((a >= min) && ( a<= max))

The original code: (in assembley)

movl    %eax, %ecx
movl    %ebx, %eax
subl    %ecx, %eax
movl    $0, %edx
cmovs   %edx, %eax
cmpl    %eax, %r12d
jl  L13
leal    (%rcx,%rbx), %eax
cmpl    $255, %eax
movb    $-1, %dl
cmovg   %edx, %eax
cmpl    %eax, %r12d
jmp L13

My Code (in assembley)

movl    %eax, %ecx
movl    %ebx, %eax
subl    %ecx, %eax
movl    $0, %edx
cmovs   %edx, %eax
cmpl    %eax, %r12d
jl  L13
leal    (%rcx,%rbx), %eax
cmpl    $255, %eax
movb    $-1, %dl
cmovg   %edx, %eax
cmpl    %eax, %r12d
jg  L13

nightcracker's code (in assembley)

movl    %r12d, %edx
subl    %ebx, %edx
movl    %ebx, %ecx
subl    %r12d, %ecx
cmpl    %ebx, %r12d
cmovle  %ecx, %edx
cmpl    %eax, %edx
jg  L16

回答3:

Just using plain ints for a, b and c will allow you to change the code to the simpler

if (a >= b - c && a <= b + c) ...

Also, as an alternative, 256*256*256 is just 16M and a map of 16M bits is 2 MBytes. This means that it's feasible to use a lookup table like

int index = (a<<16) + (b<<8) + c;
if (lookup_table[index>>3] & (1<<(index&7))) ...

but I think that the cache trashing will make this much slower even if modern processors hate conditionals...

Another alternative is to use a bit of algebra

b - c <= a <= b + c
      iff
- c <= a - b <= c        (subtracted b from all terms)
      iff
0 <= a - b + c <= 2*c    (added c to all terms)

this allows to use just one test

if ((unsigned)(a - b + c) < 2*c) ...

assuming that a, b and c are plain ints. The reason is that if a - b + c is negative then unsigned arithmetic will make it much bigger than 2*c (if c is 0..255). This should generate efficent machine code with a single branch if the processor has dedicated signed/unsigned comparison instructions like x86 (ja/jg).

#include <stdio.h>

int main()
{
    int err = 0;

    for (int ia=0; ia<256; ia++)
        for (int ib=0; ib<256; ib++)
            for (int ic=0; ic<256; ic++)
            {
                unsigned char a = ia;
                unsigned char b = ib;
                unsigned char c = ic;
                int res1 = (a >= ( b - c > 0 ? b - c : 0 ) &&
                            a <= ( b + c < 255 ? b + c : 255 ));
                int res2 = (unsigned(a - b + c) <= 2*c);

                err += (res1 != res2);
            }
    printf("Errors = %i\n", err);
    return 0;
}

On x86 with g++ the assembler code generated for the res2 test only includes one conditional instruction.

The assembler code for the following loop is

void process(unsigned char *src, unsigned char *dst, int sz)
{
    for (int i=0; i<sz; i+=3)
    {
        unsigned char a = src[i];
        unsigned char b = src[i+1];
        unsigned char c = src[i+2];
        dst[i] = (unsigned(a - b + c) <= 2*c);
    }
}


.L3:
    movzbl  2(%ecx,%eax), %ebx    ; This loads c
    movzbl  (%ecx,%eax), %edx     ; This loads a
    movzbl  1(%ecx,%eax), %esi    ; This loads b
    leal    (%ebx,%edx), %edx     ; This computes a + c
    addl    %ebx, %ebx            ; This is c * 2
    subl    %esi, %edx            ; This is a - b + c
    cmpl    %ebx, %edx            ; Comparison
    setbe   (%edi,%eax)           ; Set 0/1 depending on result
    addl    $3, %eax              ; next group
    cmpl    %eax, 16(%ebp)        ; did we finish ?
    jg  .L3                   ; if not loop back for next

Using instead dst[i] = (a<b ? b-a : a-b); the code becomes much longer

.L9:
    movzbl  %dl, %edx
    andl    $255, %esi
    subl    %esi, %edx
.L4:
    andl    $255, %edi
    cmpl    %edi, %edx
    movl    12(%ebp), %edx
    setle   (%edx,%eax)
    addl    $3, %eax
    cmpl    %eax, 16(%ebp)
    jle .L6
.L5:
    movzbl  (%ecx,%eax), %edx
    movb    %dl, -13(%ebp)
    movzbl  1(%ecx,%eax), %esi
    movzbl  2(%ecx,%eax), %edi
    movl    %esi, %ebx
    cmpb    %bl, %dl
    ja  .L9
    movl    %esi, %ebx
    movzbl  %bl, %edx
    movzbl  -13(%ebp), %ebx
    subl    %ebx, %edx
    jmp .L4
    .p2align 4,,7
    .p2align 3
.L6:

And I'm way too tired now to try to decipher it (2:28 AM here)

Anyway longer doesn't mean necessarely slower (at a first sight seems g++ decided to unroll the loop by writing a few elements at a time in this case).

As I said before you should do some actual profiling with your real computation and your real data. Note that if true performance is needed may be that the best strategy will differ depending on the processor.

For example Linux during bootstrap makes ae test to decide what is the faster way to perform a certain computation that is needed in the kernel. The variables are just too many (cache size/levels, ram speed, cpu clock, chipset, cpu type...).

回答4:

Rarely does embedding the ternary operator in another statement improve performance :)

If every single op code matters, write the op codes yourself - use assembler. Also consider using simd instructions if possible. I'd also be interested in the target platform. ARM assembler loves compares of this sort and has opcodes to speed up saturated math of this type.

来源：https://stackoverflow.com/questions/5924611/what-is-the-best-way-performance-wise-to-test-whether-a-value-falls-within-a-t

标签

c++

performance

optimization

byte