Cast performance from size_t to double

问题

TL;DR: Why is multiplying/casting data in size_t slow and why does this vary per platform?

I'm having some performance issues that I don't fully understand. The context is a camera frame grabber where a 128x128 uint16_t image is read and post-processed at a rate of several 100 Hz.

In the post-processing I generate a histogram frame->histo which is of uint32_t and has thismaxval = 2^16 elements, basically I tally all intensity values. Using this histogram I calculate the sum and squared sum:

double sum=0, sumsquared=0;
size_t thismaxval = 1 << 16;

for(size_t i = 0; i < thismaxval; i++) {
    sum += (double)i * frame->histo[i];
    sumsquared += (double)(i * i) * frame->histo[i];
}

Profiling the code with profile I got the following (samples, percentage, code):

 58228 32.1263 :  sum += (double)i * frame->histo[i];
116760 64.4204 :  sumsquared += (double)(i * i) * frame->histo[i];

or, the first line takes up 32% of CPU time, the second line 64%.

I did some benchmarking and it seems to be the datatype/casting that's problematic. When I change the code to

uint_fast64_t isum=0, isumsquared=0;

for(uint_fast32_t i = 0; i < thismaxval; i++) {
    isum += i * frame->histo[i];
    isumsquared += (i * i) * frame->histo[i];
}

it runs ~10x faster. However, this performance hit also varies per platform. On the workstation, a Core i7 CPU 950 @ 3.07GHz the code is 10x faster. On my Macbook8,1, which has a Intel Core i7 Sandy Bridge 2.7 GHz (2620M) the code is only 2x faster.

Now I am wondering:

Why is the original code so slow and easily sped up?
Why does this vary per platform so much?

Update:

I compiled the above code with

g++ -O3  -Wall cast_test.cc -o cast_test

Update2:

I ran the optimized codes through a profiler (Instruments on Mac, like Shark) and found two things:

1) The looping itself takes a considerable amount of time in some cases. thismaxval is of type size_t.

for(size_t i = 0; i < thismaxval; i++) takes 17% of my total runtime
for(uint_fast32_t i = 0; i < thismaxval; i++) takes 3.5%
for(int i = 0; i < thismaxval; i++) does not show up in the profiler, I assume it's less than 0.1%

2) The datatypes and casting matter as follows:

sumsquared += (double)(i * i) * histo[i]; 15% (with size_t i)
sumsquared += (double)(i * i) * histo[i]; 36% (with uint_fast32_t i)
isumsquared += (i * i) * histo[i]; 13% (with uint_fast32_t i, uint_fast64_t isumsquared)
isumsquared += (i * i) * histo[i]; 11% (with int i, uint_fast64_t isumsquared)

Surprisingly, int is faster than uint_fast32_t?

Update4:

I ran some more tests with different datatypes and different compilers, on one machine. The results are as follows.

For testd 0 -- 2 the relevant code is

    for(loop_t i = 0; i < thismaxval; i++)
        sumsquared += (double)(i * i) * histo[i];

with sumsquared a double, and loop_t size_t, uint_fast32_t and int for tests 0, 1 and 2.

For tests 3--5 the code is

    for(loop_t i = 0; i < thismaxval; i++)
        isumsquared += (i * i) * histo[i];

with isumsquared of type uint_fast64_t and loop_t again size_t, uint_fast32_t and int for tests 3, 4 and 5.

The compilers I used are gcc 4.2.1, gcc 4.4.7, gcc 4.6.3 and gcc 4.7.0. The timings are in percentages of total cpu time of the code, so they show relative performance, not absolute (although the runtime was quite constant at 21s). The cpu time is for both two lines, because I'm not quite sure if the profiler correctly separated the two lines of code.

gcc:    4.2.1  4.4.7  4.6.3  4.7.0
----------------------------------
test 0: 21.85  25.15  22.05  21.85
test 1: 21.9   25.05  22     22
test 2: 26.35  25.1   21.95  19.2
test 3: 7.15   8.35   18.55  19.95
test 4: 11.1   8.45   7.35   7.1
test 5: 7.1    7.8    6.9    7.05

or:

Based on this, it seems that casting is expensive, regardless of what integer type I use.

Also, it seems gcc 4.6 and 4.7 are not able to optimize loop 3 (size_t and uint_fast64_t) properly.

回答1:

For your original questions:

The code is slow because it involves the conversion from integer to float data types. That's why it's easily sped up when you use also an integer datatype for the sum-variables because it doesn't require a float-conversion anymore.
The difference is the result of several factors. For example it depends on how efficient a platform is able to perform an int->float conversion. Furthermore this conversion could also mess up processor-internal optimizations in the program flow and prediction engine, caches, ... and also the internal parallelizing-features of the processors can have a huge influence in such calculations.

For the additional questions:

"Surprisingly int is faster than uint_fast32_t"? What's the sizeof(size_t) and sizeof(int) on your platform? One guess I can make is, that both are probably 64bit and therefore a cast to 32bit not only can give you calculation errors but also includes a different-size-casting penalty.

In general try to avoid visible and hidden casts as good as possible if these aren't really necessary. For example try to find out what real datatype is hidden behind "size_t" on your environment (gcc) and use that one for the loop-variable. In your example the square of uint's cannot be a float datatype so it makes no sense to use double here. Stick to integer types to achieve maximum performance.

回答2:

On x86, the conversion of uint64_t to floating point is slower because there are only instructions to convert int64_t, int32_t and int16_t. int16_t and in 32-bit mode int64_t can only be converted using x87 instructions, not SSE.

When converting uint64_t to floating point, GCC 4.2.1 first converts the value as if it were an int64_t and then adds 2⁶⁴ if it was negative to compensate. (When using the x87, on Windows and *BSD or if you changed the precision control, beware that the conversion ignores precision control but the addition respects it.)

An uint32_t is first extended to int64_t.

When converting 64-bit integers in 32-bit mode on processors with certain 64-bit capabilities, a store-to-load forwarding issue may cause stalls. The 64-bit integer is written as two 32-bit values and read back as one 64-bit value. This can be very bad if the conversion is part of a long dependency chain (not in this case).

来源：https://stackoverflow.com/questions/10596350/cast-performance-from-size-t-to-double

标签

performance

casting