Small vs. identical types of loop variables in C/C++ for performance

与世无争的帅哥 提交于 2020-12-15 07:20:24

问题


Say I have a large nested loop of the form

long long i, j, k, i_end, j_end;
...
for (i = 0; i < i_end; i++) {
  j_bgn = get_j_bgn(i);
  for (j = j_bgn; j < j_end; j++) {
    ...
  }
}

with some large i_end and j_end, say i_end = j_end = 10000000000. If I know that j_bgn is always small, perhaps even always either 0 or 1, is it beneficial performance-wise to use a smaller type for this, like signed char j_bgn? Or does this come with a recurring cost due to implicit casting to long long each time we begin a new j loop?

I guess this has a pretty minor effect, but I would like to know the "proper"/pedantic way of doing this: Either 1) keep all loop variables of the same type (and use the smallest type that can cold the largest integer needed), or 2) choose the type of each loop variable independently to be as small as possible.

Edit

From the comments/answers I see I need to supply further information:

  • I sometimes want and sometimes do not want to use these variables (e.g. j) for indexing. Why is this relevant (as long as I make sure to use types large enough to cover my available memory)?
  • In my actual code I use something like size_t (or ssize_t) for e.g. j, j_end. On modern hardware this is 64 bit.

I take it that using types smaller than 32 bit is not worthwhile, but is it still perhaps beneficial to use a 32 bit type for j_bgn rather than also using a 64 bit type (as I really do need for j and j_end)?


回答1:


This sounds like an actual use case for the "fast" datatypes defined in <cstdint> for C++ or <stdint.h> for C.

You can use int_fast8_t, int_fast16_t, int_fast32_t, or int_fast64_t or their unsigned pendants, to get the fastest integer type that is at least 8, 16, 32, or 64 bytes large.

I guess if you want to be really pedantic, you should pick these and let the compiler pick the fastest option.




回答2:


Many platforms require some additional operations if the integers are wider or smaller than the width of the registers. (Most 64-bit platforms can handle 32-bit integers as efficiently as 64-bit, though.)

Example (with empty asm statements to stop the loops optimizing away):

void lfoo(long long int loops)
{
    for(long long int i = 0; i < loops; i++) asm("");
}

void foo(int loops)
{
    for(int i = 0; i < loops; i++) asm("");
}

void bar(short int loops)
{
    for(short int i = 0; i < loops; i++) asm("");
}

void zoo(char loops)
{
    for(char i = 0; i < loops; i++) asm("");
}

and the resulting code for old 32-bit ARM Cortex processors, without ARMv6 sign-extension instructions which make short slightly less bad (Godbolt compiler explorer, gcc8.2 default options, -O3 without -march= or -mcpu=cortex-...)

lfoo:
        cmp     r0, #1
        sbcs    r3, r1, #0
        bxlt    lr
        mov     r2, #0
        mov     r3, #0
.L3:
        adds    r2, r2, #1
        adc     r3, r3, #0        @@ long long takes 2 registers, obviously bad
        cmp     r1, r3
        cmpeq   r0, r2            @@ and also to compare
        bne     .L3
        bx      lr

foo:
        cmp     r0, #0
        bxle    lr                @ return if loops==0 (predicate condition)
        mov     r3, #0            @ i = 0
.L8:                              @ do {
        add     r3, r3, #1          @ i++  (32-bit)
        cmp     r0, r3             
        bne     .L8               @ } while(loops != i);
        bx      lr                @ return

bar:
        cmp     r0, #0
        bxle    lr
        mov     r2, #0
.L12:                            @ do {
        add     r2, r2, #1          @ i++ (32-bit)
        lsl     r3, r2, #16         @ i <<= 16
        asr     r3, r3, #16         @ i >>= 16  (sign extend i from 16 to 32)
        cmp     r0, r3
        bgt     .L12             @ }while(loops > i)
        bx      lr
                @@ gcc -mcpu=cortex-a15 for example uses
                @@  sxth    r2, r3

zoo:
        cmp     r0, #0
        bxeq    lr
        mov     r3, #0
.L16:
        add     r3, r3, #1
        and     r2, r3, #255     @ truncation to unsigned char is cheap
        cmp     r0, r2           @ but not free
        bhi     .L16
        bx      lr

As you can see the most efficient are 32 bits integers as they have the same size as processor registers (function foo).



来源:https://stackoverflow.com/questions/64773308/small-vs-identical-types-of-loop-variables-in-c-c-for-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!