问题
I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU:
#include <inttypes.h>
#include <stdlib.h>
#include <sys/mman.h>
int main()
{
uint32_t sum = 0;
uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
uint16_t *p = (buffer + 1);
int i;
for (i=0;i<14;++i) {
//printf(\"%d\\n\", i);
sum += p[i];
}
return sum;
}
This only segfaults if the memory is allocated using mmap
. If I use malloc
, a buffer on the stack, or a global variable it does not segfault.
If I decrease the number of iterations of the loop to anything less than 14 it no longer segfaults. And if I print the array index from within the loop it also no longer segfaults.
Why does unaligned memory access segfault on a CPU that is able to access unaligned addresses, and why only under such specific circumstances?
回答1:
gcc4.8 makes a prologue that tries to reach an alignment boundary, but it assumes that uint16_t *p
is 2-byte aligned, i.e. that some number of scalar iterations will make the pointer 16-byte aligned.
I don't think gcc ever intended to support misaligned pointers on x86, it just happened to work for non-atomic types without auto-vectorization. It's definitely undefined behaviour in ISO C to use a pointer to uint16_t
with less than alignof(uint16_t)=2
alignment. GCC doesn't warn when it can see you breaking the rule at compile time, and actually happens to make working code (for malloc
where it knows the return-value minimum alignment), but that's presumably just an accident of the gcc internals, and shouldn't be taken as an indication of "support".
Try with -O3 -fno-tree-vectorize
or -O2
. If my explanation is correct, that won't segfault, because it will only use scalar loads (which as you say on x86 don't have any alignment requirements).
gcc knows malloc
returns 16-byte aligned memory on this target (x86-64 Linux, where maxalign_t
is 16 bytes wide because long double
has padding out to 16 bytes in the x86-64 System V ABI). It sees what you're doing and uses movdqu
.
But gcc doesn't treat mmap
as a builtin, so it doesn't know that it returns page-aligned memory, and applies its usual auto-vectorization strategy which apparently assumes that uint16_t *p
is 2-byte aligned, so it can use movdqa
after handling misalignment. Your pointer is misaligned and violates this assumption.
(I wonder if newer glibc headers use __attribute__((assume_aligned(4096)))
to mark mmap
's return value as aligned. That would be a good idea, and would probably have given you about the same code-gen as for malloc
. Except it wouldn't work because it would break error-checking for mmap != (void*)-1
, as @Alcaro points out with an example on Godbolt: https://gcc.godbolt.org/z/gVrLWT)
on a CPU that is able to access unaligned
SSE2 movdqa
segfaults on unaligned, and your elements are themselves misaligned so you have the unusual situation where no array element starts at a 16-byte boundary.
SSE2 is baseline for x86-64, so gcc uses it.
Ubuntu 14.04LTS uses gcc4.8.2 (Off topic: which is old and obsolete, worse code-gen in many cases than gcc5.4 or gcc6.4 especially when auto-vectorizing. It doesn't even recognize -march=haswell
.)
14 is the minimum threshold for gcc's heuristics to decide to auto-vectorize your loop in this function, with -O3
and no -march
or -mtune
options.
I put your code on Godbolt, and this is the relevant part of main
:
call mmap #
lea rdi, [rax+1] # p,
mov rdx, rax # buffer,
mov rax, rdi # D.2507, p
and eax, 15 # D.2507,
shr rax ##### rax>>=1 discards the low byte, assuming it's zero
neg rax # D.2507
mov esi, eax # prolog_loop_niters.7, D.2507
and esi, 7 # prolog_loop_niters.7,
je .L2
# .L2 leads directly to a MOVDQA xmm2, [rdx+1]
It figures out (with this block of code) how many scalar iterations to do before reaching MOVDQA, but none of the code paths lead to a MOVDQU loop. i.e. gcc doesn't have a code path to handle the case where p
is odd.
But the code-gen for malloc looks like this:
call malloc #
movzx edx, WORD PTR [rax+17] # D.2497, MEM[(uint16_t *)buffer_5 + 17B]
movzx ecx, WORD PTR [rax+27] # D.2497, MEM[(uint16_t *)buffer_5 + 27B]
movdqu xmm2, XMMWORD PTR [rax+1] # tmp91, MEM[(uint16_t *)buffer_5 + 1B]
Note the use of movdqu
. There are some more scalar movzx
loads mixed in: 8 of the 14 total iterations are done SIMD, and the remaining 6 with scalar. This is a missed-optimization: it could easily do another 4 with a movq
load, especially because that fills an XMM vector after unpacking
with zero to get uint32_t elements before adding.
(There are various other missed-optimizations, like maybe using pmaddwd
with a multiplier of 1
to add horizontal pairs of words into dword elements.)
Safe code with unaligned pointers:
If you do want to write code which uses unaligned pointers, you can do it correctly in ISO C using memcpy
. On targets with efficient unaligned load support (like x86), modern compilers will still just use a simple scalar load into a register, exactly like dereferencing the pointer. But when auto-vectorizing, gcc won't assume that an aligned pointer lines up with element boundaries and will use unaligned loads.
memcpy
is how you express an unaligned load / store in ISO C / C++.
#include <string.h>
int sum(int *p) {
int sum=0;
for (int i=0 ; i<10001 ; i++) {
// sum += p[i];
int tmp;
#ifdef USE_ALIGNED
tmp = p[i]; // normal dereference
#else
memcpy(&tmp, &p[i], sizeof(tmp)); // unaligned load
#endif
sum += tmp;
}
return sum;
}
With gcc7.2 -O3 -DUSE_ALIGNED
, we get the usual scalar until an alignment boundary, then a vector loop: (Godbolt compiler explorer)
.L4: # gcc7.2 normal dereference
add eax, 1
paddd xmm0, XMMWORD PTR [rdx]
add rdx, 16
cmp ecx, eax
ja .L4
But with memcpy
, we get auto-vectorization with an unaligned load (with no intro/outro to handle alignement), unlike gcc's normal preference:
.L2: # gcc7.2 memcpy for an unaligned pointer
movdqu xmm2, XMMWORD PTR [rdi]
add rdi, 16
cmp rax, rdi # end_pointer != pointer
paddd xmm0, xmm2
jne .L2 # -mtune=generic still doesn't optimize for macro-fusion of cmp/jcc :(
# hsum into EAX, then the final odd scalar element:
add eax, DWORD PTR [rdi+40000] # this is how memcpy compiles for normal scalar code, too.
In the OP's case, simply arranging for pointers to be aligned is a better choice. It avoids cache-line splits for scalar code (or for vectorized the way gcc does it). It doesn't cost a lot of extra memory or space, and the data layout in memory isn't fixed.
But sometimes that's not an option. memcpy
fairly reliably optimizes away completely with modern gcc / clang when you copy all the bytes of a primitive type. i.e. just a load or store, no function call and no bouncing to an extra memory location. Even at -O0
, this simple memcpy
inlines with no function call, but of course tmp
doesn't optimizes away.
Anyway, check the compiler-generated asm if you're worried that it might not optimize away in a more complicated case, or with different compilers. For example, ICC18 doesn't auto-vectorize the version using memcpy.
uint64_t tmp=0;
and then memcpy over the low 3 bytes compiles to an actual copy to memory and reload, so that's not a good way to express zero-extension of odd-sized types, for example.
来源:https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64