How to vectorize my loop with g++?

痞子三分冷 提交于 2019-11-28 07:40:13

The O3 flag turns on -ftree-vectorize automatically. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

So in both cases the compiler is trying to do loop vectorization.

Using g++ 4.8.2 to compile with:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

Gives this:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               


Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             

test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   

test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               

test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          

test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      

test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      

test.cpp:28: note: Unroll loop 1 times  

Compiling without the -ftree-vectorize flag:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

Returns only this:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

But here's a couple of other things you can try too:

  • Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.

  • Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

GCC has extensions to the compiler that creates new primitives that will use SIMD instructions. Take a look here for details.

Most compilers say they will auto-vectorize operations but this depends on the compiler pattern matching, but as you imagine this can be very hit and miss.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!