问题
I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:
#include <stdio.h>
#include <time.h>
/// @return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}
int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern
int const Pattern_0_0_2_2 = 0x40000002;
/// @return - half of Billion
int unpredictableIfs()
{
int sum = 0;
for ( int i = 0; i < Billion; ++i )
{
// true, true, false, false ...
if ( ( i & Pattern_0_0_2_2 ) == 0 )
{
++sum;
}
}
return sum;
}
/// @return - half of Billion
int noIfs()
{
int sum = 0;
for ( int i = 0; i < Billion; ++i )
{
// 1, 1, 0, 0 ...
sum += ( i & Pattern_0_0_2_2 ) == 0;
}
return sum;
}
int main()
{
clock_t volatile start;
clock_t volatile stop;
int volatile sum;
printf( "Puzzling measurements:\n" );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = noIfs();
stop = clock();
printf( "Same without ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
start = clock();
sum = unpredictableIfs();
stop = clock();
printf( "Unpredictable ifs took %d msec; answer was %d\n"
, ms_elapsed(start, stop), sum );
}
Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:
Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000
Edit: Full switches of compiler:
/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue
Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:
Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Also he posted such results for -O3 optimizations:
Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Now I have question. What is going on here?
More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?
Thanks for attention!
Edit: I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.
I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.
回答1:
With -O1, gcc-4.7.1 calls unpredictableIfs only once and resuses the result, since it recognizes that it's a pure function, so the result will be the same every time it's called. (Mine did, verified by looking at the generated assembly.)
With higher optimisation level, the functions are inlined, and the compiler doesn't recognize that it's the same code anymore, so it is run each time a function call appears in the source.
Apart from that, my gcc-4.7.1 deals best with unpredictableIfs when using -O1 or -O2 (apart from the reuse issue, both produce the same code), while noIfs is treated much better with -O3. The timings between the different runs of the same code are however consistent here - equal or differing by 10 milliseconds (granularity of clock), so I have no idea what could cause the substantially different times for unpredictableIfs you reported for -O3.
With -O2, the loop for unpredictableIfs is identical to the code generated with -O1 (except for register swapping):
.L12:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L12
and for noIfs it's similar:
.L15:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
cmpl $1000000000, %eax
jne .L15
where it was
.L7:
testl $1073741826, %edx
sete %cl
movzbl %cl, %ecx
addl %ecx, %eax
addl $1, %edx
cmpl $1000000000, %edx
jne .L7
with -O1. Both loops run in similar time, with unpredictableIfs a bit faster.
With -O3, the loop for unpredictableIfs becomes worse,
.L14:
leal 1(%rdx), %ecx
testl $1073741826, %eax
cmove %ecx, %edx
addl $1, %eax
cmpl $1000000000, %eax
jne .L14
and for noIfs (including the setup-code here), it becomes better:
pxor %xmm2, %xmm2
movq %rax, 32(%rsp)
movdqa .LC3(%rip), %xmm6
xorl %eax, %eax
movdqa .LC2(%rip), %xmm1
movdqa %xmm2, %xmm3
movdqa .LC4(%rip), %xmm5
movdqa .LC5(%rip), %xmm4
.p2align 4,,10
.p2align 3
.L18:
movdqa %xmm1, %xmm0
addl $1, %eax
paddd %xmm6, %xmm1
cmpl $250000000, %eax
pand %xmm5, %xmm0
pcmpeqd %xmm3, %xmm0
pand %xmm4, %xmm0
paddd %xmm0, %xmm2
jne .L18
.LC2:
.long 0
.long 1
.long 2
.long 3
.align 16
.LC3:
.long 4
.long 4
.long 4
.long 4
.align 16
.LC4:
.long 1073741826
.long 1073741826
.long 1073741826
.long 1073741826
.align 16
.LC5:
.long 1
.long 1
.long 1
.long 1
it computes four iterations at once, and accordingly, noIfs runs almost four times as fast then.
回答2:
Right, looking at the assembler code from gcc on 64-bit Linux, the first case, with -O1, the function UnpredictableIfs is indeed called only once, and the result reused.
With -O2 and -O3, the functions are inlined, and the time it takes should be identical. There is also no actual branches in either bit of code, but the translation for the two bits of code is somewhat different, I've cut out the lines that update "sum" [in %edx in both cases]
UnpredictableIfs:
movl %eax, %ecx
andl $1073741826, %ecx
cmpl $1, %ecx
adcl $0, %edx
addl $1, %eax
NoIfs:
xorl %ecx, %ecx
testl $1073741826, %eax
sete %cl
addl $1, %eax
addl %ecx, %edx
As you can see, it's not quite identical, but it does very similar things.
回答3:
Regarding the range of results on Windows (from 1016 ms to 4797 ms): You should know that clock() in MSVC returns elapsed wall time. The standard says clock() should return an approximation of CPU time spent by the process, and other implementations do a better job of this.
Given that MSVC is giving wall time, if your process got pre-empted while running one iteration of the test, it could give a much larger result, even if the code ran in approximately the same amount of CPU time.
Also note that clock() on many Windows PCs has a pretty lousy resolution, often like 11-19 ms. You've done enough iterations that that's only about 1%, so I don't think it's part of the discrepancy, but it's good to be aware of when trying to write a benchmark. I understand you're going for portability, but if you needed a better measurement on Windows, you can use QueryPerformanceCounter which will almost certainly give you much better resolution, though it's still just elapsed wall time.
UPDATE: After I learned that the long runtime on the one case was happening consistently, I fired up VS2010 and reproduced the results. I was typically getting something around 1000 ms for some runs, 750 ms for others, and 5000+ ms for the inexplicable ones.
Observations:
- In all cases the unpredictableIfs() code was inlined.
- Removing the noIfs() code had no impact (so the long time wasn't a side effect of that code).
- Setting thread affinity to a single processor had no effect.
- The 5000 ms times were invariably the later instances. I noted that the later instances had an extra instruction before the beginning of the loop:
lea ecx,[ecx]. I don't see why that should make a 5x difference. Other than that the early and later instances were identical code. - Removing the
volatilefrom thestartandstopvariables yielded fewer long runs, more of the 750 ms runs, and no 1000 ms runs. (The generated loop code looks exactly the same in all cases now, notleas.) - Removing the
volatilefrom thesumvariable (but keeping it for the clock timers), the long runs can happen in any position. - If you remove all of the
volatilequalifiers, you get consistent, fast (750 ms) runs. (The code looks identical to the earlier ones, butediwas chosen forsuminstead ofecx.)
I'm not sure what to conclude from all this, except that volatile has unpredictable performance consequences with MSVC, so you should apply it only when necessary.
UPDATE 2: I'm seeing consistent runtime differences tied to the use of volatile, even though the disassembly is almost identical.
With volatile:
Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000
The disassembly for each instance looks like this:
start = clock();
010D1015 mov esi,dword ptr [__imp__clock (10D20A0h)]
010D101B add esp,4
010D101E call esi
010D1020 mov dword ptr [start],eax
sum = unpredictableIfs();
010D1023 xor ecx,ecx
010D1025 xor eax,eax
010D1027 test eax,40000002h
010D102C jne main+2Fh (10D102Fh)
010D102E inc ecx
010D102F inc eax
010D1030 cmp eax,3B9ACA00h
010D1035 jl main+27h (10D1027h)
010D1037 mov dword ptr [sum],ecx
stop = clock();
010D103A call esi
010D103C mov dword ptr [stop],eax
Without volatile:
Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
start = clock();
00321014 mov esi,dword ptr [__imp__clock (3220A0h)]
0032101A add esp,4
0032101D call esi
0032101F mov dword ptr [start],eax
sum = unpredictableIfs();
00321022 xor ebx,ebx
00321024 xor eax,eax
00321026 test eax,40000002h
0032102B jne main+2Eh (32102Eh)
0032102D inc ebx
0032102E inc eax
0032102F cmp eax,3B9ACA00h
00321034 jl main+26h (321026h)
stop = clock();
00321036 call esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.
Other than register selection, I don't see a significant difference.
来源:https://stackoverflow.com/questions/14965553/difficulties-to-measure-c-c-performance