auto-vectorization

sum of overlapping arrays, auto-vectorization, and restrict

点点圈 提交于 2019-11-26 23:35:11
问题 Arstechnia recently had an article Why are some programming languages faster than others. It compares Fortran and C and mentions summing arrays. In Fortran it's assumed that arrays don't overlap so that allows further optimization. In C/C++ pointers to the same type may overlap so this optimization can't be used in general. However, in C/C++ one can use the restrict or __restrict keyword to tell the compiler not to assume the pointers overlap. So I started looking into this in regards to auto

Unroll loop and do independent sum with vectorization

强颜欢笑 提交于 2019-11-26 21:50:48
问题 For the following loop GCC will only vectorize the loop if I tell it to use associative math e.g. with -Ofast . float sumf(float *x) { x = (float*)__builtin_assume_aligned(x, 64); float sum = 0; for(int i=0; i<2048; i++) sum += x[i]; return sum; } Here is the assembly with -Ofast -mavx sumf(float*): vxorps %xmm0, %xmm0, %xmm0 leaq 8192(%rdi), %rax .L2: vaddps (%rdi), %ymm0, %ymm0 addq $32, %rdi cmpq %rdi, %rax jne .L2 vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 vperm2f128 $1,

Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?

允我心安 提交于 2019-11-26 16:42:31
I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU: #include <inttypes.h> #include <stdlib.h> #include <sys/mman.h> int main() { uint32_t sum = 0; uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); uint16_t *p = (buffer + 1); int i; for (i=0;i<14;++i) { //printf("%d\n", i); sum += p[i]; } return sum; } This only segfaults if the memory is allocated using mmap . If I use malloc , a buffer on the stack, or a global variable it does not segfault. If I decrease the number of iterations of the loop to anything less than 14 it

Why does unaligned access to mmap&#39;ed memory sometimes segfault on AMD64?

為{幸葍}努か 提交于 2019-11-26 04:55:08
问题 I have this piece of code which segfaults when run on Ubuntu 14.04 on an AMD64 compatible CPU: #include <inttypes.h> #include <stdlib.h> #include <sys/mman.h> int main() { uint32_t sum = 0; uint8_t *buffer = mmap(NULL, 1<<18, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); uint16_t *p = (buffer + 1); int i; for (i=0;i<14;++i) { //printf(\"%d\\n\", i); sum += p[i]; } return sum; } This only segfaults if the memory is allocated using mmap . If I use malloc , a buffer on the stack, or a global