Simple for() loop benchmark takes the same time with any loop bound

核能气质少年 提交于 2020-06-21 05:22:06

问题


I'm willing to write a code that makes my CPU execute some operations and see how much time does he take to solve them. I wanted to make a loop going from i=0 to i<5000 and then multiplying i by a constant number and time that. I've ended up with this code, it has no errors but it only takes 0.024 seconds to execute the code even if I change the loop i<49058349083 or if i<2 it takes the same amount of time. Whats the error?

PD: I started yesterday learning C++ i'm sorry if this is a really easy question to answer but I couldn't find the solution

#include <iostream>
#include <ctime>

using namespace std;

int main () {
    int start_s=clock();

    int i;
    for(i=0;i<5000;i++){
        i*434243;
    }

    int stop_s=clock();
    cout << "time: "<< (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000;

    return 0;
}

回答1:


BTW, if you'd actually done i<49058349083, gcc and clang create an infinite loop on systems with 32-bit int (including x86 and x86-64). 49058349083 is greater than INT_MAX. Large literal numbers are implicitly promoted to a type large enough to hold them, so you effectively did (int64_t)i < 49058349083LL, which is true for any possible value of int i.

Signed overflow is Undefined Behaviour in C++, and so is an infinite loop with no side-effects inside the loop (e.g. a system call), so I checked on the Godbolt compiler explorer to see how it really compiles with optimization enabled. Fun fact: MSVC still optimizes away an empty loop (including one with i*10 not assigned to anything) when the condition is an always-true compare, rather than a non-zero constant like 42.


A loop like this is pretty fundamentally flawed.

You can microbenchmark a complete non-inline function using Google's benchmark package (how would you benchmark the performance of a function), but to learn anything useful by putting something inside a repeat-loop you have to know a lot about how your compiler compiles to asm, exactly what you're trying to measure, and how to get the optimizer to make asm that's similar to what you'd get from your code in some real use context. e.g. by using inline asm to require it to have a certain result in a register, or by assigning to a volatile variable (which also has the overhead of doing a store).

If this sound a lot more complicated than you were hoping, well too bad, it is, and for good reason.

That's because compilers are incredibly complex pieces of machinery that can usually produce fairly efficient executables out of source that's written to clearly express what's going on for human readers, not to avoid redundant work or look anything like an efficient machine-language implementation (which is what your CPU actually runs).


Microbenchmarking is hard - you can't get meaningful results unless you understand how your code compiles and what you're really measuring.

Optimizing compilers are designed to create an executable that produces the same results as your C++ source, but which runs as quickly as possible. Performance is not an observable result, so it's always legal to make a program more efficient. This is the "as-if" rule: What exactly is the "as-if" rule?

You want your compiler to not waste time and code-size computing results that aren't used. After the compiler inlines a function into the caller, it often turns out that some of the things it computes aren't used. It's normal for well-written C++ code to have lots of work that can be thrown away, including optimizing away temporary variables entirely; this is not a bad thing and a compiler that didn't do this would suck.

Remember that you're writing for the C++ abstract machine, but your compiler's job is to translate that into assembly language (or machine code) for your CPU. Assembly language is quite different from C++. (And modern higher-performance CPUs can also execute instructions out of order, while following their own "as-if" rule to preserve the illusion of the compiler-generated code running in program order. But CPUs can't discard work, only overlap it.)

You can't microbenchmark the int * int binary operator in C++ in the general case, even just for your own desktop (nevermind on other hardware / different compilers). Different uses in different contexts will compile to different code. Even if you could create some version of your loop which measured something useful, it wouldn't necessarily tell you much about how expensive foo = a * b is in another program. The other program might bottleneck on multiply latency instead of throughput, or combine that multiply with some other nearby operation on a or b, or any number of things.


Don't disable optimization

You might think it would be useful to just disable optimization (like gcc -O0 instead of gcc -O3). But the only way to do that also introduces anti-optimizations like storing every value back to memory after every C++ statement, and reloading variables from memory for the next statement. This lets you modify variable values while debugging a compiled program, or even jump to a new line in the same function, and still get the results you'd expect from looking at the C++ source.

Supporting that level of interference imposes a huge burden on the compiler. Store/reload (store-forwarding) has about 5 cycle latency on a typical modern x86. This means an anti-optimized loop can only run one iteration per ~6 clock cycles at best, vs. 1 cycle for a tight loop like looptop: dec eax / jnz looptop where the loop counter is in a register.

There's not much middle ground: compilers don't have options to make asm that "looks like" the C++ source, but keeping values in registers across statements. That wouldn't be useful or meaningful anyway, because that's not how they compile with full optimization enabled. (gcc -Og might be a little bit like this.)

Spending time modifying your C++ to make it run faster with -O0 is a total waste of time: C loop optimization help for final assignment.

Note: anti-optimized debug mode (-O0) is the default for most compilers. It's also "compile-fast" mode, so it's good for seeing if your code compiles / runs at all, but it's useless for benchmarking. The resulting compiler-generated asm runs the speed it does for reasons which depend on the hardware, but don't tell you much of anything about how fast optimized code will run. (e.g. the answer to Adding a redundant assignment speeds up code when compiled without optimization is some fairly subtle Intel Sandybridge-family store-forwarding latency behaviour, directly caused by the store/reloads and having the loop bottleneck on the loop counter being in memory. Note that the first version of the question was asking about why doing that made the C faster, which was rightly downvoted because benchmarking -O0 is silly. It only turned into an interesting question when I edited it into an x86 asm question which is interesting because of the larger-but-faster asm, not because it came from gcc -O0 with any particular source changes.)

And this is not even mentioning C++ standard library functions like std::sort or std::vector::push_back, which depend on the optimizer to inline lots of nested calls to tiny helper / wrapper functions.


How compilers optimize

(I'm going to show transformations of C++ code. Remember that the compiler is actually transforming an internal representation of your program's logic and then producing asm. You can think of the transformed C++ as being pseudo-code for asm, where i++ represents an x86 inc eax instruction or something. Most C/C++ statements can map to 1 or a couple machine instructions. So it's a useful way of describing the logic of what the actual compiler-generated asm might be doing without actually writing it in asm.)

A result which is never used doesn't have to be computed in the first place. A loop with no side effects can be removed completely. Or a loop that assigns to a global variable (an observable side effect) could be optimized to just doing the last assignment. e.g.

// int gsink;  is a global.  
// "sink" is the opposite of a source: something we dump results into.
for (int i=0 ; i<n ; i++) {
    gsink = i*10;
}

is equivalent to this code, as far as an optimizing compiler is concerned:

if (! (0 < n)) {     // you might not have noticed that your loop could run 0 times
    gsink = n*10;    // but the compiler isn't allowed to do gsink=0 if n<1
}

If gsink were instead a local or file-scoped static with nothing that reads it, the compiler could optimize it away entirely. But the compiler can't "see" code outside the current C++ source file ("compilation unit") while compiling a function containing that, so it can't change the observable side-effect that when the function returns, gsink = n*10;.

Nothing reads the intermediate values of gsink because there are no function calls to non-inline function. (Because it's not an atomic<int>, the compiler can assume that no other threads or signal handlers read it; that would be data-race Undefined Behaviour.)


Using volatile to get the compiler to do some work.

If it were global volatile int gsink, the actual store that places the value in memory would be an observable side-effect (that's what volatile means in C++). But does that mean we can benchmark multiplication this way? No, it doesn't. The side-effect the compiler has to preserve is only the placing of the final value in memory. If it can calculate it more cheaply than with i * 10 every time through the loop, it will do so.

This loop would also produce the same result sequence of assignments to gsink, and thus is a valid option for an optimizing compiler. Transforming independent multiplies to a loop-carried add is called a "strength-reduction" optimization.

volatile int gsink;

int i10 = 0;   // I could have packed this all into a for() loop
int i=0;       // but this is more readable
while (i<n) {
    gsink = i10;
    i10 += 10;
    i++;
}

Could the compiler drop i altogether and use i10 < n*10 as the loop condition? (Of course hoisting the upperbound = n*10 calculation out of the loop.)

That wouldn't always give the same behaviour, because n*10 can overflow, so the loop can run at most INT_MAX/10 times if implemented that way. But signed overflow in C++ is undefined behaviour, and the i*10 in the loop body would overflow in any program where n*10 overflowed, so the compiler can safely introduce an n*10 without changing the behaviour of any legal / well-defined program. See What Every C Programmer Should Know About Undefined Behavior.

(Actually, i*10 is only evaluated for i up to n-1 at most, and n*10 could overflow while (n-1)*10 doesn't. What gcc actually does is more like while(i10 != n*10) using unsigned math, when compiling for x86. x86 is a 2's complement machine, so unsigned and signed are the same binary operation, and checking for != instead of signed less-than is safe even if (unsigned)n*10UL == 0x8000000UL, which is INT_MIN.)

For more about looking at compiler asm output, and a total-beginner intro to x86 asm, see Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”. (And related: How to remove "noise" from GCC/clang assembly output?). See http://agner.org/optimize/ for more about how current x86 CPUs perform.

Compiler output for this function from gcc7.3 -O3, compiling for x86-64, on the Godbolt compiler explorer:

volatile int gvsink;

void store_n(int n) {
    for(int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(int):          # n in EDI  (x86-64 System V calling convention)
    test    edi, edi
    jle     .L5                   # if(n<=0) goto end

    lea     edx, [rdi+rdi*4]      # edx = n * 5
    xor     eax, eax              # tmp = 0
    add     edx, edx              # edx = n * 10

.L7:                                   # do {
    mov     DWORD PTR gvsink[rip], eax   # gvsink = tmp
    add     eax, 10                      # tmp += 10
    cmp     eax, edx
    jne     .L7                        # } while(tmp != n*10)
.L5:
    rep ret

The optimal / idiomatic asm loop structure is a do{}while(), so compilers always try to make loops like this. (That doesn't mean you have to write your source that way, but you can let the compiler avoid the check for zero iterations in cases like this where it can't prove that.)

If we'd used unsigned int, overflow would be well-defined as wraparound, so there's no UB the compiler can use as an excuse to compile your code in a way you didn't expect. (Modern C++ is not a forgiving language. Optimizing compilers are quite hostile to programmers who don't carefully avoid any UB, and this can get pretty subtle because a lot of stuff is undefined behaviour. Compiling C++ for x86 is not like writing x86 assembly. But fortunately there are compiler options like gcc -fsanitize=undefined which will detect and warn about UB at runtime. You still have to check every possible input value you care about, though.)

void store_n(unsigned int n) {
    for(unsigned int i=0 ; i<n; i++) {
        gvsink = i*10;
    }
}

store_n(unsigned int):
    test    edi, edi
    je      .L9            # if (n==0) return;
    xor     edx, edx       # i10 = 0
    xor     eax, eax       # i = 0
.L11:                      # do{
    add     eax, 1         #    i++
    mov     DWORD PTR gvsink[rip], edx
    add     edx, 10        #    i10 += 10
    cmp     edi, eax       
    jne     .L11           # } while(i!=n)
.L9:
    rep ret

Clang compiles with two separate counters for signed and unsigned. Clang's code is more like

i10 = 0;
do {
    gvsink = i10;
    i10 += 10;
} while(--n != 0);

So it counts down the n register toward zero, avoiding a separate cmp instruction because add/sub instructions also set the condition-code flags which the CPU can branch on. (Clang unrolls small loops by default, generating i10, i10 + 10, i10 + 20, up to i10 + 70 in registers it can store from, while only running the loop-overhead instructions once. There's not a lot to be gained here from unrolling on typical modern CPUs, though. One store per clock cycle is a bottleneck, and so is 4 uops per clock (on Intel CPUs) issuing from the front-end into the out-of-order part of the core.


Getting the compiler to multiply:

How do we stop that strength-reduction optimization? Replacing *10 with * variable doesn't work, then we just get code that adds are register instead of adding an immediate constant.

We could turn it into a loop over arrays like a[i] = b[i] * 10;, but then we'd be dependent on memory bandwidth as well. Also, that could auto-vectorize with SIMD instructions, which we might or might not want to be testing.

We could do something like tmp *= i; (with unsigned, to avoid signed-overflow UB). But that makes the output of the multiply in each iteration an input for the next. So we'd be benchmarking multiply latency, not throughput. (CPUs are pipelined, and for example can start a new multiply every clock cycle, but the result of a single multiply isn't ready until 3 clock cycles later. So you'd need at least tmp1*=i, tmp2*=i, and tmp3*=i to keep the integer multiply unit on an Intel Sandybridge-family CPU saturated with work.

This comes back to having to know exactly what you're measuring to make a meaningful microbenchmark at this level of detail.

If this answer is going way over your head, stick to timing whole functions! It really isn't possible to say much about a single C arithmetic operator or an expression without understanding the surrounding context, and how it works in asm.

If you understand caching, you can understand memory access and arrays vs. linked lists pretty decently without getting into too much asm-level detail. It's possible to understand C++ performance in some level of detail without knowing much about asm (beyond the fact that it exists, and that compilers optimize heavily). But the more you know about asm, CPU performance tuning, and how compilers work, the more things start to make sense.


PS:

Any computation based on compile-time-constant values can (and hopefully is) done at compile time. This is called "constant propagation". Hiding constants from the optimizer (e.g. by inputting them as command line args (atoi(argv[1]), or with other tricks) can make the compiler-generated code for a microbenchmark look more like the real use-case, if that use-case also can't see constants at compile time. (But note that constants defined in other files become visible with link-time optimization, which is very good for projects with lots of small functions that call each other across source file boundaries and aren't defined in headers (.h) where they can inline normally.)

Multiplying by 16 (or any other power of 2) will use a shift, because that's more efficient than an actual multiply instruction. This is a big deal for division, especially. See Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?, and Why does GCC use multiplication by a strange number in implementing integer division?.

Other multiply constants with only a couple bits set in their binary representation can be done with some shift+add, often with lower latency than a general-purpose multiply instruction. See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.

None of these optimizations are possible with a * b or a / b if neither input is known at compile time.


See also: How can I benchmark the performance of C++ code?, especially the links to Chandler Carruth's CppCon 2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!".

And because it's worth mentioning twice: Matt Godbolt's CppCon2017 talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”. It's a gentle enough introduction that a beginner can probably follow it well enough to look at how their loop compiled, and see if it optimized away or not.




回答2:


Because of the body of the for loop:

i*434243;

which does nothing, so assuming you are compiling the code with optimization flags enabled, the compiler wipes that out.

Changing it to:

int a = i*434243;

would be likely be optimized out on anything else than -O0, so I wouldn't suggest it.

Moreover, this will lead to Undefined Behavior, because overflowing, since the constant value you are using is relatively big, as i continues to increment.

I suggest you do instead:

int a = i * i;
cout << a << "\n";   



回答3:


The code you write is not the instructions your CPU is executing. In a nutshell: The compiler translates your code into machine instructions that can be anything as long as the result is the same as if it would perform the exact steps as you wrote down in your code (Colloquially known as the "as-if" rule). Consider this example:

int foo() {
    int x = 0;
    for (int i=0;i<1000;i++){
        x += i;
    }
    return 42;
}

int bar() {
    return 42;
}

The two function look quite different, however, the compiler will likely create the exact same machine code for them, because you have no way to decide whether the extra loop was executed or not (burning CPU power and taking time does not count for the as-if rule).

How aggresively the compiler optimizes your code is controlled by the -O flag. Typically -O0 for debug builds (because it compiles faster) and -O2 or -O3 for release builds.

Timing code can be tricky because you have to make sure that you are actually measuring something. For foo you could make sure that the loop is executed (*) by writing this:

int foo() {
    int x = 0;
    for (int i=0;i<1000;i++){
        x += i;
    }
    return x;
}

(*) = Even this wont result in running the loop on most compilers, as this kind of loop is such a common pattern that it is detected and results in something along the line of x = 1000*1001/2;.




回答4:


It is often very difficult to get the compiler to keep the code you are interested in profiling.

I highly recommend profiling the actual code that does something useful as there are so many pitfalls in predicting the timing of actual code based on a number of separate measurements.

If you wish to continue, one option is to declare a variable as volatile and assign your answer to it.

volatile int a = i * 434243;

Another is to create a function and return the value. You may need to disable in-lining.

You are very unlikely to answer a question like "how long does a multiply take?". You are always answering questions like, "how long does it take to do a multiply and give me the answer?".

You typically need to "keep temps" and inspect the assembler to make sure it is doing what you expect. If you are calling a function then you may want to compare calling a function that does nothing to try and eliminate calling overhead from your timing.




回答5:


I slightly altered your code to be the following:

#include <iostream>
#include <ctime>

using namespace std;

int main() {
    for (long long int j = 10000000; j <= 10000000000; j = j * 10) {

        int start_s = clock();

        for (long long int i = 0; i < j; i++) {
            i * 434243;
        }

        int stop_s = clock();

        cout << "time: " << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000 << endl;
    }

    int k;
    cin >> k;
    return 0;
}

Output: (Which looks perfectly fine to me.)

time: 23
time: 224
time: 2497
time: 21697

There is one thing to be aware of. Since i is an integer, it never will be equal to 49058349083 anyways. In your case upper bound is converted to int which corresponds to anything between -2,147,483,648 and 2,147,483,647 so loop runs anything between 0 and 2,147,483,647 times which is not that big of a number for a simple multiplication operation. (1813708827 in case of 49058349083).

Try using long long int which can be between -2^63 and 2^63-1



来源:https://stackoverflow.com/questions/50924929/simple-for-loop-benchmark-takes-the-same-time-with-any-loop-bound

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!