Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:

#include <inttypes.h>
#include <stdlib.h>

#include <sys/mman.h>

int main()
{
  uint32_t sum = 0;
  uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
                         MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  uint16_t *p = (buffer + 1);
  int i;

  for (i=0;i<14;++i) {
    //printf("%d\n", i);
    sum += p[i];
  }

  return sum;
}

This only segfaults if the memory is allocated using mmap. If I use malloc, a buffer on the stack, or a global variable it does not segfault.

If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.

Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?

gcc4.8 makes a prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.

I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t with less than alignof(uint16_t)=2 alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".

Try with -O3 -fno-tree-vectorize or -O2. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).

gcc knows malloc returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t is 16 bytes wide because long double has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu.

But gcc doesn't treat mmap as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p is 2-byte aligned, so it can use movdqa after handling misalignment. Your pointer is misaligned and violates this assumption.

(I wonder if newer glibc headers use __attribute__((assume_aligned(4096))) to mark mmap's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc. Except it wouldn't work because it would break error-checking for mmap != (void*)-1, as @Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)

on a CPU that is able to access unaligned

SSE2 movdqa segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.

SSE2 is baseline for x86-64, so gcc uses it.

Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell.)

14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3 and no -march or -mtune options.

I put your code on Godbolt, and this is the relevant part of main:

    call    mmap    #
    lea     rdi, [rax+1]      # p,
    mov     rdx, rax  # buffer,
    mov     rax, rdi  # D.2507, p
    and     eax, 15   # D.2507,
    shr     rax        ##### rax>>=1 discards the low byte, assuming it's zero
    neg     rax       # D.2507
    mov     esi, eax  # prolog_loop_niters.7, D.2507
    and     esi, 7    # prolog_loop_niters.7,
    je      .L2
    # .L2 leads directly to a MOVDQA xmm2, [rdx+1]

It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p is odd.

But the code-gen for malloc looks like this:

    call    malloc  #
    movzx   edx, WORD PTR [rax+17]        # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
    movzx   ecx, WORD PTR [rax+27]        # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
    movdqu  xmm2, XMMWORD PTR [rax+1]   # tmp91, MEM[(uint16_t *)buffer_5 + 1B]

Note the use of movdqu. There are some more scalar movzx loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq load, especially because that fills an XMM vector after unpacking with zero to get uint32_t elements before adding.

(There are various other missed-optimizations, like maybe using pmaddwd with a multiplier of 1 to add horizontal pairs of words into dword elements.)

Safe code with unaligned pointers:

If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.

memcpy is how you express an unaligned load / store in ISO C / C++.

#include <string.h>

int sum(int *p) {
    int sum=0;
    for (int i=0 ; i<10001 ; i++) {
        // sum += p[i];
        int tmp;
#ifdef USE_ALIGNED
        tmp = p[i];     // normal dereference
#else
        memcpy(&tmp, &p[i], sizeof(tmp));  // unaligned load
#endif
        sum += tmp;
    }
    return sum;
}

With gcc7.2 -O3 -DUSE_ALIGNED, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)

.L4:    # gcc7.2 normal dereference
    add     eax, 1
    paddd   xmm0, XMMWORD PTR [rdx]
    add     rdx, 16
    cmp     ecx, eax
    ja      .L4

But with memcpy, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:

.L2:   # gcc7.2 memcpy for an unaligned pointer
    movdqu  xmm2, XMMWORD PTR [rdi]
    add     rdi, 16
    cmp     rax, rdi      # end_pointer != pointer
    paddd   xmm0, xmm2
    jne     .L2           # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(

    # hsum into EAX, then the final odd scalar element:
    add     eax, DWORD PTR [rdi+40000]   # this is how memcpy compiles for normal scalar code, too.

In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.

But sometimes that's not an option. memcpy fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0, this simple memcpy inlines with no function call, but of course tmp doesn't optimizes away.

Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.

uint64_t tmp=0; and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.

来源：https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64

标签

gcc

x86-64

mmap

auto-vectorization