How to make GCC generate bswap instruction for big endian store without builtins?

后端 未结 3 862
栀梦
栀梦 2020-12-05 04:55

I\'m working on a function that stores a 64-bit value into memory in big endian format. I was hoping that I could write portable C99 code that works on both little a

3条回答
  •  攒了一身酷
    2020-12-05 05:36

    All functions in this answer with asm output on the Godbolt Compiler Explorer


    GNU C has a uint64_t __builtin_bswap64 (uint64_t x), since GNU C 4.3. This is apparently the most reliable way to get gcc / clang to generate code that doesn't suck for this.

    glibc provides htobe64, htole64, and similar host to/from BE and LE functions that swap or not, depending on the endianness of the machine. See the docs for . The man page says they were added to glibc in version 2.9 (released 2008-11).

    #define _BSD_SOURCE             /* See feature_test_macros(7) */
    
    #include 
    
    #include 
    // ideal code with clang from 3.0 onwards, probably earlier
    // ideal code with gcc from 4.4.7 onwards, probably earlier
    uint64_t load_be64_endian_h(const uint64_t *be_src) { return be64toh(*be_src); }
        movq    (%rdi), %rax
        bswap   %rax
    
    void store_be64_endian_h(uint64_t *be_dst, uint64_t data) { *be_dst = htobe64(data); }
        bswap   %rsi
        movq    %rsi, (%rdi)
    
    // check that the compiler understands the data movement and optimizes away a double-conversion (which inline-asm `bswap` wouldn't)
    // it does optimize away with gcc 4.9.3 and later, but not with gcc 4.9.0 (2x bswap)
    // optimizes away with clang 3.7.0 and later, but not clang 3.6 or earlier (2x bswap)
    uint64_t double_convert(uint64_t data) {
      uint64_t tmp;
      store_be64_endian_h(&tmp, data);
      return load_be64_endian_h(&tmp);
    }
        movq    %rdi, %rax
    

    You safely get good code even at -O1 from those functions, and they use movbe when -march is set to a CPU that supports that insn.


    If you're targeting GNU C, but not glibc, you can borrow the definition from glibc (remember it's LGPLed code, though):

    #ifdef __GNUC__
    # if __GNUC_PREREQ (4, 3)
    
    static __inline unsigned int
    __bswap_32 (unsigned int __bsx) { return __builtin_bswap32 (__bsx);  }
    
    # elif __GNUC__ >= 2
        // ... some fallback stuff you only need if you're using an ancient gcc version, using inline asm for non-compile-time-constant args
    # endif  // gcc version
    #endif // __GNUC__
    

    If you really need a fallback that might compile well on compilers that don't support GNU C builtins, the code from @bolov's answer could be used to implement a bswap that compiles nicely. Pre-processor macros could be used to choose whether to swap or not (like glibc does), to implement host-to-BE and host-to-LE functions. The bswap used by glibc when __builtin_bswap or x86 asm isn't available uses the mask-and-shift idiom that bolov found was good. gcc recognizes it better than just shifting.


    The code from this Endian-agnostic coding blog post compiles to bswap with gcc, but not with clang. IDK if there's anything that both their pattern-recognizers will recognize.

    // Note that this is a load, not a store like the code in the question.
    uint64_t be64_to_host(unsigned char* data) {
        return
          ((uint64_t)data[7]<<0)  | ((uint64_t)data[6]<<8 ) |
          ((uint64_t)data[5]<<16) | ((uint64_t)data[4]<<24) |
          ((uint64_t)data[3]<<32) | ((uint64_t)data[2]<<40) |
          ((uint64_t)data[1]<<48) | ((uint64_t)data[0]<<56);
    }
    
        ## gcc 5.3 -O3 -march=haswell
        movbe   (%rdi), %rax
        ret
    
        ## clang 3.8 -O3 -march=haswell
        movzbl  7(%rdi), %eax
        movzbl  6(%rdi), %ecx
        shlq    $8, %rcx
        orq     %rax, %rcx
        ... completely naive implementation
    

    The htonll from this answer compiles to two 32bit bswaps combined with shift/or. This kind of sucks, but isn't terrible with either gcc or clang.


    I didn't have any luck with a union { uint64_t a; uint8_t b[8]; } version of the OP's code. clang still compiles it to a 64bit bswap, but I think compiles to even worse code with gcc. (See the godbolt link).

提交回复
热议问题