Mysteries of C++ optimization

倖福魔咒の 提交于 2020-01-14 10:13:49

问题


Take the two following snippets:

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i);

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}

and

int main()
{
    unsigned long int start = utime();

    __int128_t n = 128;

    for(__int128_t i=1; i<1000000000; i++)
        n = (n * i) >> 2;

    unsigned long int end = utime();

    cout<<(unsigned long int) n<<endl;

    cout<<end - start<<endl;
}

I am benchmarking 128 bit integers in C++. When executing the first one (just multiplication) everything runs in approx. 0.95 seconds. When I also add the bit shift operation (second snippet) the execution time raises to an astounding 2.49 seconds.

How is this possible? I thought that bit shifting was one of the lightest operations for a processor. How comes that there is so much overhead due to such a simple operation? I am compiling with O3 flag activated.

Any idea?


回答1:


This question has been bugging me for the past few days, so I decided to do some more investigation. My initial answer focused on the difference in data values between the two tests. My assertion was that the integer multiplication unit in the processor finishes an operation in fewer clock cycles if one of the operands is zero.

While there are instructions that are clearly documented to work that way (integer division, for example), there are very strong indications that integer multiplication is done in a constant number of cycles in modern processors, regardless of input. The note in Intel's documentation that initially made me think that the number of cycles for integer multiplication can depend on input data doesn't seem to apply to these instructions. Also, I did some more rigorous performance tests with the same sequence of instructions on both zero and non-zero operands and the results didn't yield significant differences. As far as I can tell, harold's comment on this subject is correct. My mistake; sorry.

While contemplating the possibility of deleting this answer altogether, so that it doesn't lead people astray in the future, I realized there were still quite a few more things worth saying on this subject. I also think there's at least one other way in which data values can influence performance in such calculations (included in the last section). So, I decided to restructure and enhance the rest of the information in my initial answer, started writing, and... didn't quite stop for a while. It's up to you to decide whether it was worth it.

The information is structured into the following sections:

  • What the code does
  • What the compiler does
  • What the processor does
  • What you can do about it
  • Unanswered questions

What the code does

It overflows, mostly.

In the first version, n starts overflowing on the 33rd iteration. In the second version, with the shift, n starts overflowing on the 52nd iteration.

In the version without the shift, starting with the 128th iteration, n is zero (it overflows "cleanly", leaving only zeros in the least significant 128 bits of the result).

In the second version, the right shift (dividing by 4) takes out more factors of two from the value of n on each iteration than the new operands bring in, so the shift results in rounding on some iterations. Quick calculation: the total number of factors of two in all numbers from 1 to 128 is equal to

128 / 2 + 128 / 4 + ... + 2 + 1 = 26 + 25 + ... + 2 + 1 = 27 - 1

while the number of factors of two taken out by the right shift (if it had enough to take from) is 128 * 2, more than double.

Armed with this knowledge, we can give a first answer: from the point of view of the C++ standard, this code spends most of its time in undefined behaviour land, so all bets are off. Problem solved; stop reading now.

What the compiler does

If you're still reading, from this point forward we'll ignore the overflows and look at some implementation details. "The compiler", in this case, means GCC 4.9.2 or Clang 3.5.1. I've only done performance measurements on code generated by GCC. For Clang, I've looked at the generated code for a few test cases and noted some differences that I'll mention below, but I haven't actually run the code; I might have missed some things.

Both multiplication and shift operations are available for 64-bit operands, so 128-bit operations need to be implemented in terms of those. First, multiplication: n can be written as 264nh + nl, where nh and nl are the high and low 64-bit halves, respectively. The same goes for i. So, the multiplication can be written:

(264nh + nl)(264ih + il) = 2128nh ih + 264 (nh il + nl ih) + nl il

The first term doesn't have any non-zero bits in the lower 128-bit part; it's either all overflow or all zero. Since ignoring integer overflows is valid and common for C++ implementations, that's what the compiler does: the first term is ignored completely.

The parenthesis only contributes bits to the upper 64-bit half of the 128-bit result; any overflow resulting from the two multiplications or the addition is also ignored (the result is truncated to 64 bits).

The last term determines the bits in the low 64-bit half of the result and, if the result of that multiplication has more than 64 bits, the extra bits need to be added to the high 64-bit half obtained from the parenthesis discussed before. There's a very useful multiplication instruction in x86-64 assembly that does just what's needed: takes two 64-bit operands and places the result in two 64-bit registers, so the high half is ready to be added to the result of the operations in the parenthesis.

That is how 128-bit integer multiplication is implemented: three 64-bit multiplications and two 64-bit additions.

Now, the shift: using the same notations as above, the two least significant bits of nh need to become the two most significant bits of nl, after the contents of the latter is shifted right by two bits. Using C++ syntax, it would look like this:

nl = nh << 62 | nl >> 2 //Doesn't change nh, only uses its bits.

Besides that, nh also needs to be shifted, using something like

nh >>= 2;

That is how the compiler implements a 128-bit shift. For the first part, there's an x86-64 instruction that has the exact semantics of that expression; it's called SHRD. Using it can be good or bad, as we'll see below, and the two compilers make slightly different choices in this respect.

What the processor does

... is highly processor-dependent. (No... really?!)

Detailed information about what happens for Haswell processors can be found in harold's excellent answer. Here, I'll try to cover more ground at a higher level. For more detailed data, here are some sources:

  • Intel® 64 and IA-32 Architectures Optimization Reference Manual
  • Software Optimization Guide for AMD Family 15h Processors
  • Agner Fog's microarchitecture manual and instruction tables.

I'll refer to the following architectures:

  • Intel Sandy Bridge / Ivy Bridge - abbreviated "IntelSB" going forward;
  • Intel Haswell / Broadwell - "IntelH" going forward;
  • I'll just use "Intel" for things that are the same between SB and H.
  • AMD Bulldozer / Piledriver / Steamroller - "AMD" going forward.

I have measurement data taken on an IntelSB system; I think it's precise enough, as long as the compiler doesn't act up. Unfortunately, when working with such tight loops, this can happen very easily. At various points during testing, I had to use all kinds of stupid tricks to avoid GCC's idiosyncrasies, usually related to register use. For example, it seems to have a tendency to shuffle registers around unnecessarily, when compiling simpler code than for other cases when it generates optimal assembly. Ironically, on my test setup, it tended to generate optimal code for the second sample, using the shift, and worse code for the first one, making the impact of the shift less visible. Clang/LLVM seems to have fewer of those bad habits, but then again, I looked at fewer examples using it and I didn't measure any of them, so this doesn't mean much. In the interest of comparing apples with apples, all measurement data below refers to the best code generated for each case.

First, let's rearrange the expression for 128-bit multiplication from the previous section into a (horrible) diagram:

nh * il
        \
         +  -> tmp
        /          \
nl * ih             + -> next nh
                   /
             high 64 bits
                 /
nl * il --------
         \
      low 64 bits 
           \
             -> next nl

(sorry, I hope it gets the point across)

Some important points:

  • The two additions can't execute until their respective inputs are ready; the final addition can't execute until everything else is ready.
  • The three multiplications can, theoretically, execute in parallel (no input depends on another multiplication's output).
  • In the ideal scenario above, the total number of cycles to complete the entire calculation for one iteration is the sum of the number of cycles for one multiplication and two additions.
  • The next nl can be ready early. This, together with the fact that the next il and ih are very cheap to calculate, means the nl * il and nl * ih calculations for the next iteration can start early, possibly before the next nh has been computed.

Multiplications can't really execute entirely in parallel on these processors, as there's only one integer multiplication unit for each core, but they can execute concurrently through pipelining. One multiplication can begin executing on each cycle on Intel, and every 4 cycles on AMD, even if previous multiplications haven't finished executing yet.

All of the above mean that, if the loop's body doesn't contain anything else that gets in the way, the processor can reorder those multiplications to achieve something as close as possible to the ideal scenario above. This applies to the first code snippet. On IntelH, as measured by harold, it's exactly the ideal scenario: 5 cycles per iteration are made up of 3 cycles for one multiplication and one cycle each for the two additions (impressive, to be honest). On IntelSB, I measured 6 cycles per iteration (closer to 5.5, actually).

The problem is that in the second code snippet something does get in the way:

nh * il
        \                              normal shift -> next nh
         +  -> tmp                   /
        /          \                /
nl * ih             + ----> temp nh
                   /                \
             high 64 bits            \
                 /                     "composite" shift -> next nl
nl * il --------                     /
         \                          /
      low 64 bits                  /
           \                      /
             -> temp nl ---------

The next nl is no longer ready early. temp nl has to wait for temp nh to be ready, so that both can be fed into the composite shift, and only then will we have the next nl. Even if both shifts are very fast and execute in parallel, they don't just add the execution cost of one shift to an iteration; they also change the dynamics of the loop's "pipeline", acting like a sort of synchronizing barrier.

If the two shifts finish at the same time, then all three multiplications for the next iteration will be ready to execute at the same time, and they can't all start in parallel, as explained above; they'll have to wait for one another, wasting cycles. This is the case on IntelSB, where the two shifts are equally fast (see below); I measured 8 cycles per iteration for this case.

If the two shifts don't finish at the same time, it will typically be the normal shift that finishes first (the composite shift is slower on most architectures). This means that the next nh will be ready early, so the top multiplication can start early for the next iteration. However, the other two multiplications still have to wait more (wasted) cycles for the composite shift to finish, and then they'll be ready at the same time and one will have to wait for the other to start, wasting some more time. This is the case on IntelH, measured by harold at 9 cycles per iteration.

I expect AMD to fall under this last category as well. While there's an even bigger difference in performance between the composite shift and normal shift on this platform, integer multiplications are also slower on AMD than on Intel (more than twice as slow), making the first sample slower to begin with. As a very rough estimate, I think the first version could take about 12 cycles on AMD, with the second one at around 16. It would be nice to have some concrete measurements, though.

Some more data on the troublesome composite shift, SHRD:

  • On IntelSB, it's exactly as cheap as a simple shift (great!); simple shifts are about as cheap as they come: they execute in one cycle, and two shifts can start executing each cycle.
  • On IntelH, SHRD takes 3 cycles to execute (yes, it got worse in the newer generation), and two shifts of any kind (simple or composite) can start executing each cycle;
  • On AMD, it's even worse. If I'm reading the data correctly, executing an SHRD keeps both shift execution units busy until execution finishes - no parallelism and no pipelining possible; it takes 3 cycles, during which no other shift can start executing.

What you can do about it

I can think of three possible improvements:

  1. replace SHRD with something faster on platforms where it makes sense;
  2. optimize the multiplication to take advantage of the data types involved here;
  3. restructure the loop.

1. SHRD can be replaced with two shifts and a bitwise OR, as described in the compiler section. A C++ implementation of a 128-bit shift right by two bits could look like this:

__int128_t shr2(__int128_t n)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   uint64_t rl = nl >> 2 | nh << 62;
   int64_t rh = nh >> 2;

   //Pack the result.
   return static_cast<__int128_t>(rh) << 64 | rl;
}

Although it looks like a lot of code, only the middle section doing the actual work generates shifts and ORs. The other parts merely indicate to the compiler which 64-bit parts we want to work with; since the 64-bit parts are already in separate registers, those are effectively no-ops in the generated assembly code.

However, keep in mind that this amounts to "trying to write assembly using C++ syntax", and it's generally not a very bright idea. I'm only using it because I verified that it works for GCC and I'm trying to keep the amount of assembly code in this answer to a minimum. Even so, there's one surprise: the LLVM optimizer detects what we're trying to do with those two shifts and one OR and... replaces them with an SHRD in some cases (more about this below).

Functions of the same form can be used for shifts by other numbers of bits, less than 64. From 64 to 127, it gets easier, but the form changes. One thing to keep in mind is that it would be a mistake to pass the number of bits for shifting as a runtime parameter to a shr function. Shift instructions by a variable number of bits are slower than the ones using a constant number on most architectures. You could use a non-type template parameter to generate different functions at compile time - this is C++, after all...

I think using such a function makes sense on all architectures except IntelSB, where SHRD is already as fast as it can get. On AMD, it will definitely be an improvement. Less so on IntelH: for our case, I don't think it will make a difference, but generally it could shave once cycle off some calculations; there could theoretically be cases where it could make things slightly worse, but I think those are very uncommon (as usual, there's no substitute for measuring). I don't think it will make a difference for our loop because it will change things from [nh being ready after once cycle and nl after three] to [both being ready after two]; this means all three multiplications for the next iteration will be ready at the same time and they'll have to wait for one another, essentially wasting the cycle that was gained by the shift.

GCC seems to use SHRD on all architectures, and the "assembly in C++" code above can be used as an optimization where it makes sense. The LLVM optimizer uses a more nuanced approach: it does the optimization (replaces SHRD) automatically for AMD, but not for Intel, where it even reverses it, as mentioned above. This may change in future releases, as indicated by the discussion on the patch for LLVM that implemented this optimization. For now, if you want to use the alternative with LLVM on Intel, you'll have to resort to assembly code.

2. Optimizing the multiplication: The test code uses a 128-bit integer for i, but that's not needed in this case, as its value fits easily in 64 bits (32, actually, but that doesn't help us here). What this means is that ih will always be zero; this reduces the diagram for 128-bit multiplication to the following:

nh * il
        \
         \
          \
           + -> next nh
          /
    high 64 bits
        /
nl * il 
        \
     low 64 bits 
          \
            -> next nl

Normally, I'd just say "declare i as long long and let the compiler optimize things" but unfortunately this doesn't work here; both compilers go for the standard behaviour of converting the two operands to their common type before doing the calculation, so i ends up on 128 bits even if it starts on 64. We'll have to do things the hard way:

__int128_t mul(__int128_t n, long long i)
{
   using std::int64_t;
   using std::uint64_t;

   //Unpack the two halves.
   int64_t nh = n >> 64;
   uint64_t nl = static_cast<uint64_t>(n);

   //Do the actual work.
   __asm__(R"(
    movq    %0, %%r10
    imulq   %2, %%r10
    mulq    %2
    addq    %%r10, %0
   )" : "+d"(nh), "+a"(nl) : "r"(i) : "%r10");

   //Pack the result.
   return static_cast<__int128_t>(nh) << 64 | nl;
}

I said I tried to avoid assembly code in this answer, but it's not always possible. I managed to coax GCC into generating the right code with "assembly in C++" for the function above, but once the function is inlined everything falls apart - the optimizer sees what's going on in the complete loop body and converts everything to 128 bits. LLVM seems to behave in this case, but, since I was testing on GCC, I had to use a reliable way to get the right code in there.

Declaring i as long long and using this function instead of the normal multiplication operator, I measured 5 cycles per iteration for the first sample and 7 cycles for the second one on IntelSB, a gain of one cycle in each case. I expect it to shave one cycle off the iterations for both examples on IntelH as well.

3. The loop can sometimes be restructured to encourage pipelined execution, when (at least some) iterations don't really depend on previous results, even though it may look like they do. For example, we could replace the for loop for the second sample with something like this:

__int128_t n2 = 1;
long long j = 1000000000 / 2;
for(long long i = 1; i < 1000000000 / 2; ++i, ++j)
{
   n *= i;
   n2 *= j;
   n >>= 2;
   n2 >>= 2; 
}
n *= (n2 * j) >> 2;

This takes advantage of the fact that some partial results can be calculated independently and only aggregated at the end. We're also hinting to the compiler that we want to pipeline the multiplications and shifts (not always necessary, but it does make a small difference for GCC for this code).

The code above is nothing more than a naive proof of concept. Real code would need to handle the total number of iterations in a more reliable way. The bigger problem is that this code won't generate the same results as the original, because of different behaviour in the presence of overflow and rounding. Even if we stop the loop on the 51st iteration, to avoid overflow, the result will still be different by about 10%, because of rounding happening in different ways when shifting right. In real code, this would most likely be a problem; but then again, you wouldn't have real code like this, would you?

Assuming this technique is applied to a case where the problems above don't occur, I measured the performance of such code in a few cases, again on IntelSB. The results are given in "cycles per iteration", as before, where "iteration" means the one from the original code (I divided the total number of cycles for executing the whole loop by the total number of iterations executed by the original code, not for the restructured one, to have a meaningful comparison):

  • The code above executes in 7 cycles per iteration, a gain of one cycle over the original.
  • The code above with the multiplication operator replaced with our mul() function needs 6 cycles per iteration.

The restructured code does suffer from more register shuffling, which can't be avoided unfortunately (more variables). More recent processors like IntelH have architecture improvements that make register moves essentially free in many cases; this could make the code yield even larger gains. Using new instructions like MULX for IntelH could avoid some register moves altogether; GCC does use such instructions when compiling with -march=haswell.

Unanswered questions

None of the measurements that we have so far explain the large differences in performance reported by the OP, and observed by me on a different system.

My initial timings were taken on a remote system (Westmere family processor) where, of course, a lot of things could happen; yet, the results were strangely stable.

On that system, I also experimented with executing the second sample with a right shift and a left shift; the code using a right shift was consistently 50% slower than the other variant. I couldn't replicate that on my controlled test system on IntelSB, and I don't have an explanation for those results either.

We can discard all of the above as unpredictable side effects of compiler / processor / overall system behaviour, but I can't shake the feeling that not everything has been explained here.

It's true that it doesn't really make much sense to benchmark such tight loops without a controlled system, precise tools (counting cycles) and looking at the generated assembly code for each case. Compiler idiosyncrasies can easily result in code that artificially introduces differences of 50% or more in performance.

Another factor that could explain large differences is the presence of Intel Hyper-Threading. Different parts of the core behave differently when this is enabled, and the behaviour has also changed between processor families. This could have strange effects on tight loops.

To top everything off, here's a crazy hypothesis: Flipping bits consumes more power than keeping them constant. In our case, the first sample, working with zero values most of the time, will be flipping far fewer bits than the second one, so the latter will consume more power. Many modern processors have features that dynamically adjust the core frequency depending on electrical and thermal limits (Intel Turbo Boost / AMD Turbo Core). This means that, theoretically, under the right (or wrong?) conditions, the second sample could trigger a reduction of the core frequency, thus making the same number of cycles take longer time, and making the performance data-dependent.




回答2:


After benchmarking both (using the assembly generated by GCC 4.7.3 on -O2) on my 4770K, I found that the first one takes 5 cycles per iteration and the second one takes 9 cycles per iteration. Why so much difference?

It turns out to be an interplay between throughput and latency. The main killer is shrd, which takes 3 cycles and is on the critical path. Here's a picture of it (I ignore the chain for i because it is faster and there is plenty of spare throughput for it to just run ahead, it will not interfere):

The edges here are dependencies, not dataflow.

Based solely on latencies in this chain, the expected time would be 8 cycles per iteration. But it is not. The problem here is that for 8 cycles to happen, mul2 and imul3 have to be executed in parallel, and integer multiplication only has a throughput of 1/cycle. So it (either one) has to wait a cycle, and holds up the chain by a cycle. I verified this by changing that imul to an add, which reduced the time to 8 cycles per iteration. Changing the other imul to an add had no effect, as predicted based on this explanation (it doesn't depend on shrd and can thus be scheduled earlier, without interfering with the other multiplications).

These exact details are only for Haswell.

The code I used was this:

section .text

global cmp1
proc_frame cmp1
[endprolog]
    mov r8, rsi
    mov r9, rdi
    mov esi, 1
    xor edi, edi
    mov eax, 128
    xor edx, edx
.L2:
    mov rcx, rdx
    mov rdx, rdi
    imul    rdx, rax
    imul    rcx, rsi
    add rcx, rdx
    mul rsi
    add rdx, rcx
    add rsi, 1
    mov rcx, rsi
    adc rdi, 0
    xor rcx, 10000000
    or  rcx, rdi
    jne .L2
    mov rdi, r9
    mov rsi, r8
    ret
endproc_frame

global cmp2
proc_frame cmp2
[endprolog]
    mov r8, rsi
    mov r9, rdi
    mov esi, 1
    xor edi, edi
    mov eax, 128
    xor edx, edx
.L3:
    mov rcx, rdi
    imul    rcx, rax
    imul    rdx, rsi
    add rcx, rdx
    mul rsi
    add rdx, rcx
    shrd    rax, rdx, 2
    sar rdx, 2
    add rsi, 1
    mov rcx, rsi
    adc rdi, 0
    xor rcx, 10000000
    or  rcx, rdi
    jne .L3
    mov rdi, r9
    mov rsi, r8
    ret
endproc_frame



回答3:


Unless your processor can support native 128-bit operations, the operations will have to be software coded to use the next best option.

Your 128-bit operations are using the same scheme as the 8-bit processors did when using 16-bit operations, and this takes time.

For example, a 128-bit right shift, by one bit, using 64-bit registers requires:
Shift the Most Significant register right into carry. The Carry flag will contain the bit that was shifted out.
Shift the Least Significant register right, with carry. The bits will be shifted right, with the carry flag being shifted into the Most Significant Bit position.

Without support for native 128-bit operations, you code will take twice as many operations as the same 64-bit operations; sometimes more (such as multiplication). This is why you are seeing such poor performance.

I highly recommend only using 128-bits in places where it is extremely necessary.



来源:https://stackoverflow.com/questions/28903408/mysteries-of-c-optimization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!