问题
Say I have a large nested loop of the form
long long i, j, k, i_end, j_end;
...
for (i = 0; i < i_end; i++) {
j_bgn = get_j_bgn(i);
for (j = j_bgn; j < j_end; j++) {
...
}
}
with some large i_end
and j_end
, say i_end = j_end = 10000000000
. If I know that j_bgn
is always small, perhaps even always either 0
or 1
, is it beneficial performance-wise to use a smaller type for this, like signed char j_bgn
? Or does this come with a recurring cost due to implicit casting to long long
each time we begin a new j
loop?
I guess this has a pretty minor effect, but I would like to know the "proper"/pedantic way of doing this: Either 1) keep all loop variables of the same type (and use the smallest type that can cold the largest integer needed), or 2) choose the type of each loop variable independently to be as small as possible.
Edit
From the comments/answers I see I need to supply further information:
- I sometimes want and sometimes do not want to use these variables (e.g.
j
) for indexing. Why is this relevant (as long as I make sure to use types large enough to cover my available memory)? - In my actual code I use something like
size_t
(orssize_t
) for e.g.j
,j_end
. On modern hardware this is 64 bit.
I take it that using types smaller than 32 bit is not worthwhile, but is it still perhaps beneficial to use a 32 bit type for j_bgn
rather than also using a 64 bit type (as I really do need for j
and j_end
)?
回答1:
This sounds like an actual use case for the "fast" datatypes defined in <cstdint> for C++ or <stdint.h>
for C.
You can use int_fast8_t
, int_fast16_t
, int_fast32_t
, or int_fast64_t
or their unsigned pendants, to get the fastest integer type that is at least 8, 16, 32, or 64 bytes large.
I guess if you want to be really pedantic, you should pick these and let the compiler pick the fastest option.
回答2:
Many platforms require some additional operations if the integers are wider or smaller than the width of the registers. (Most 64-bit platforms can handle 32-bit integers as efficiently as 64-bit, though.)
Example (with empty asm
statements to stop the loops optimizing away):
void lfoo(long long int loops)
{
for(long long int i = 0; i < loops; i++) asm("");
}
void foo(int loops)
{
for(int i = 0; i < loops; i++) asm("");
}
void bar(short int loops)
{
for(short int i = 0; i < loops; i++) asm("");
}
void zoo(char loops)
{
for(char i = 0; i < loops; i++) asm("");
}
and the resulting code for old 32-bit ARM Cortex processors, without ARMv6 sign-extension instructions which make short
slightly less bad (Godbolt compiler explorer, gcc8.2 default options, -O3
without -march=
or -mcpu=cortex-...
)
lfoo:
cmp r0, #1
sbcs r3, r1, #0
bxlt lr
mov r2, #0
mov r3, #0
.L3:
adds r2, r2, #1
adc r3, r3, #0 @@ long long takes 2 registers, obviously bad
cmp r1, r3
cmpeq r0, r2 @@ and also to compare
bne .L3
bx lr
foo:
cmp r0, #0
bxle lr @ return if loops==0 (predicate condition)
mov r3, #0 @ i = 0
.L8: @ do {
add r3, r3, #1 @ i++ (32-bit)
cmp r0, r3
bne .L8 @ } while(loops != i);
bx lr @ return
bar:
cmp r0, #0
bxle lr
mov r2, #0
.L12: @ do {
add r2, r2, #1 @ i++ (32-bit)
lsl r3, r2, #16 @ i <<= 16
asr r3, r3, #16 @ i >>= 16 (sign extend i from 16 to 32)
cmp r0, r3
bgt .L12 @ }while(loops > i)
bx lr
@@ gcc -mcpu=cortex-a15 for example uses
@@ sxth r2, r3
zoo:
cmp r0, #0
bxeq lr
mov r3, #0
.L16:
add r3, r3, #1
and r2, r3, #255 @ truncation to unsigned char is cheap
cmp r0, r2 @ but not free
bhi .L16
bx lr
As you can see the most efficient are 32 bits integers as they have the same size as processor registers (function foo
).
来源:https://stackoverflow.com/questions/64773308/small-vs-identical-types-of-loop-variables-in-c-c-for-performance