auto-vectorization

gcc auto-vectorisation (unhandled data-ref)

余生长醉 提交于 2019-12-07 12:16:36
问题 I do not understand why such code is not vectorized with gcc 4.4.6 int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex) { for (int i = 0; i < iSize; i++) pfResult[i] = pfResult[i] + pfTab[iIndex]; } note: not vectorized: unhandled data-ref However, if I write the following code int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex) { float fTab = pfTab[iIndex]; for (int i = 0; i < iSize; i++) pfResult[i] = pfResult[i] + fTab; } gcc succeeds auto-vectorize

gcc auto-vectorisation (unhandled data-ref)

和自甴很熟 提交于 2019-12-06 04:46:17
I do not understand why such code is not vectorized with gcc 4.4.6 int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex) { for (int i = 0; i < iSize; i++) pfResult[i] = pfResult[i] + pfTab[iIndex]; } note: not vectorized: unhandled data-ref However, if I write the following code int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex) { float fTab = pfTab[iIndex]; for (int i = 0; i < iSize; i++) pfResult[i] = pfResult[i] + fTab; } gcc succeeds auto-vectorize this loop if I add omp directive int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex) {

How to help gcc vectorize C code

白昼怎懂夜的黑 提交于 2019-12-04 19:58:45
问题 I have the following C code. The first part just reads in a matrix of complex numbers from standard in into matrix called M . The interesting part is the second part. #include <stdio.h> #include <complex.h> #include <stdlib.h> #include <assert.h> #include <math.h> int main() { int n, m, c, d; float re, im; scanf("%d %d", &n, &m); assert(n==m); complex float M[n][n]; for(c=0; c<n; c++) { for(d=0; d<n; d++) { scanf("%f%fi", &re, &im); M[c][d] = re + im * I; } } for(c=0; c<n; c++) { for(d=0; d<n

How to help gcc vectorize C code

时间秒杀一切 提交于 2019-12-03 12:52:32
I have the following C code. The first part just reads in a matrix of complex numbers from standard in into matrix called M . The interesting part is the second part. #include <stdio.h> #include <complex.h> #include <stdlib.h> #include <assert.h> #include <math.h> int main() { int n, m, c, d; float re, im; scanf("%d %d", &n, &m); assert(n==m); complex float M[n][n]; for(c=0; c<n; c++) { for(d=0; d<n; d++) { scanf("%f%fi", &re, &im); M[c][d] = re + im * I; } } for(c=0; c<n; c++) { for(d=0; d<n; d++) { printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d])); } printf("\n"); } /* Example:input 2 3

How to enable sse3 autovectorization in gcc

大兔子大兔子 提交于 2019-11-29 11:41:06
I have a simple loop with takes the product of n complex numbers. As I perform this loop millions of times I want it to be as fast as possible. I understand that it's possible to do this quickly using SSE3 and gcc intrinsics but I am interested in whether it is possible to get gcc to auto-vectorize the code. Here is some sample code #include <complex.h> complex float f(complex float x[], int n ) { complex float p = 1.0; for (int i = 0; i < n; i++) p *= x[i]; return p; } The assembly you get from gcc -S -O3 -ffast-math is: .file "test.c" .section .text.unlikely,"ax",@progbits .LCOLDB2: .text

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

僤鯓⒐⒋嵵緔 提交于 2019-11-29 07:12:42
Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance? If alignment is important for the generated code, how can I

Understanding gcc 4.9.2 auto-vectorization output

﹥>﹥吖頭↗ 提交于 2019-11-29 02:34:40
I am trying to learn gcc auto-vectorization module. After reading documentation from here . Here is what I tried (debian jessie amd64): $ cat ex1.c int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } } And then, I simply run: $ gcc -x c -Ofast -msse2 -c -ftree-vectorize -fopt-info-vec-missed ex1.c ex1.c:5:3: note: misalign = 0 bytes of ref b[i_11] ex1.c:5:3: note: misalign = 0 bytes of ref c[i_11] ex1.c:5:3: note: misalign = 0 bytes of ref a[i_11] ex1.c:5:3: note: virtual phi. skip. ex1.c:5:3: note: num. args = 4 (not unary/binary/ternary op). ex1.c:5:3:

sum of overlapping arrays, auto-vectorization, and restrict

混江龙づ霸主 提交于 2019-11-28 01:52:45
Arstechnia recently had an article Why are some programming languages faster than others . It compares Fortran and C and mentions summing arrays. In Fortran it's assumed that arrays don't overlap so that allows further optimization. In C/C++ pointers to the same type may overlap so this optimization can't be used in general. However, in C/C++ one can use the restrict or __restrict keyword to tell the compiler not to assume the pointers overlap. So I started looking into this in regards to auto-vectorization. The following code vectorizes in GCC and MSVC void dot_int(int *a, int *b, int *c, int

Unroll loop and do independent sum with vectorization

守給你的承諾、 提交于 2019-11-28 01:08:22
For the following loop GCC will only vectorize the loop if I tell it to use associative math e.g. with -Ofast . float sumf(float *x) { x = (float*)__builtin_assume_aligned(x, 64); float sum = 0; for(int i=0; i<2048; i++) sum += x[i]; return sum; } Here is the assembly with -Ofast -mavx sumf(float*): vxorps %xmm0, %xmm0, %xmm0 leaq 8192(%rdi), %rax .L2: vaddps (%rdi), %ymm0, %ymm0 addq $32, %rdi cmpq %rdi, %rax jne .L2 vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm0 vaddps %ymm1, %ymm0, %ymm0 vzeroupper ret This clearly shows the loop has been

Understanding gcc 4.9.2 auto-vectorization output

孤街浪徒 提交于 2019-11-27 16:49:24
问题 I am trying to learn gcc auto-vectorization module. After reading documentation from here. Here is what I tried (debian jessie amd64): $ cat ex1.c int a[256], b[256], c[256]; foo () { int i; for (i=0; i<256; i++){ a[i] = b[i] + c[i]; } } And then, I simply run: $ gcc -x c -Ofast -msse2 -c -ftree-vectorize -fopt-info-vec-missed ex1.c ex1.c:5:3: note: misalign = 0 bytes of ref b[i_11] ex1.c:5:3: note: misalign = 0 bytes of ref c[i_11] ex1.c:5:3: note: misalign = 0 bytes of ref a[i_11] ex1.c:5:3