Why would uint32_t be preferred rather than uint_fast32_t?

前端 未结 11 1435
没有蜡笔的小新
没有蜡笔的小新 2021-01-31 00:58

It seems that uint32_t is much more prevalent than uint_fast32_t (I realise this is anecdotal evidence). That seems counter-intuitive to me, though.

11条回答
  •  感动是毒
    2021-01-31 01:29

    From the viewpoint of correctness and ease of coding, uint32_t has many advantages over uint_fast32_t in particular because of the more precisely defined size and arithmetic semantics, as many users above have pointed out.

    What has perhaps been missed is that the one supposed advantage of uint_fast32_t - that it can be faster, just never materialized in any meaningful way. Most of the 64-bit processors that have dominated the 64-bit era (x86-64 and Aarch64 mostly) evolved from 32-bit architectures and have fast 32-bit native operations even in 64-bit mode. So uint_fast32_t is just the same as uint32_t on those platforms.

    Even if some of the "also ran" platforms like POWER, MIPS64, SPARC only offer 64-bit ALU operations, the vast majority of interesting 32-bit operations can be done just fine on 64-bit registers: the bottom 32-bit will have the desired results (and all mainstream platforms at least allow you to load/store 32-bits). Left shift is the main problematic one, but even that can be optimized in many cases by value/range tracking optimizations in the compiler.

    I doubt the occasional slightly slower left shift or 32x32 -> 64 multiplication is going to outweigh double the memory use for such values, in all but the most obscure applications.

    Finally, I'll note that while the tradeoff has largely been characterized as "memory use and vectorization potential" (in favor of uint32_t) versus instruction count/speed (in favor of uint_fast32_t) - even that isn't clear to me. Yes, on some platforms you'll need additional instructions for some 32-bit operations, but you'll also save some instructions because:

    • Using a smaller type often allows the compiler to cleverly combine adjacent operations by using one 64-bit operation to accomplish two 32-bit ones. An example of this type of "poor man's vectorization" is not uncommon. For example, create of a constant struct two32{ uint32_t a, b; } into rax like two32{1, 2} can be optimized into a single mov rax, 0x20001 while the 64-bit version needs two instructions. In principle this should also be possible for adjacent arithmetic operations (same operation, different operand), but I haven't seen it in practice.
    • Lower "memory use" also often leads to fewer instructions, even if memory or cache footprint isn't a problem, because any type structure or arrays of this type are copied, you get twice the bang for your buck per register copied.
    • Smaller data types often exploit better modern calling conventions like the SysV ABI which pack data structure data efficiently into registers. For example, you can return up to a 16-byte structure in registers rdx:rax. For a function returning structure with 4 uint32_t values (initialized from a constant), that translates into

      ret_constant32():
          movabs  rax, 8589934593
          movabs  rdx, 17179869187
          ret
      

      The same structure with 4 64-bit uint_fast32_t needs a register move and four stores to memory to do the same thing (and the caller will probablyhave to read the values back from memory after the return):

      ret_constant64():
          mov     rax, rdi
          mov     QWORD PTR [rdi], 1
          mov     QWORD PTR [rdi+8], 2
          mov     QWORD PTR [rdi+16], 3
          mov     QWORD PTR [rdi+24], 4
          ret
      

      Similarly, when passing structure arguments, 32-bit values are packed about twice as densely into the registers available for parameters, so it makes it less likely that you'll run out of register arguments and have to spill to the stack1.

    • Even if you choose to use uint_fast32_t for places where "speed matters" you'll often also have places where you need a fixed size type. For example, when passing values for external output, from external input, as part of your ABI, as part of a structure that needs a specific layout, or because you smartly use uint32_t for large aggregations of values to save on memory footprint. In the places where your uint_fast32_t and ``uint32_t` types need to interface, you might find (in addition to the development complexity), unnecessary sign extensions or other size-mismatch related code. Compilers do an OK job at optimizing this away in many cases, but it still not unusual to see this in optimized output when mixing types of different sizes.

    You can play with some of the examples above and more on godbolt.


    1 To be clear, the convention of packing structures tightly into registers isn't always a clear win for smaller values. It does mean that the smaller values may have to be "extracted" before they can be used. For example a simple function that returns the sum of the two structure members together needs a mov rax, rdi; shr rax, 32; add edi, eax while for the 64-bit version each argument gets its own register and just needs a single add or lea. Still if you accept that the "tightly pack structures while passing" design makes sense overall, then smaller values will take more advantage of this feature.

提交回复
热议问题