AVX2 code slower then without AVX2

I have been trying to get started with the AVX2 instructions with not a lot of luck (this list of functions have been helpful). At the end, I got my first program compiling and doing what I wanted. The program that I have to do takes two u_char and compounds a double out of it. Essentially, I use this to decode data stored in an array of u_char from a camera but I do not think is relevant for this question.

The process of obtaining the doubleof of the two u_char is:

double result = sqrt(double((msb<<8) + lsb)/64);

where msb and lsb are the two u_char variables with the most significant bits (msb) and the less significant bits (lsb) of the double to compute. The data is stored in an array representing a row-major matrix where the msb and lsb of the value encoded column i are in the second and third rows respectively. I have coded this with and without AVX2:

void getData(u_char* data, size_t cols, std::vector<double>& info)
{
  info.resize(cols);
  for (size_t i = 0; i < cols; i++)
  {
    info[i] = sqrt(double((data[cols + i] << 8) + data[2 * cols + i]) / 64.0);
    ;
  }
}

void getDataAVX2(u_char* data, size_t cols, std::vector<double>& info)
{
  __m256d dividend = _mm256_set_pd(1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0);
  info.resize(cols);
  __m256d result;
  for (size_t i = 0; i < cols / 4; i++)
  {
    __m256d divisor = _mm256_set_pd(double((data[4 * i + 3 + cols] << 8) + data[4 * i + 2 * cols + 3]),
                                    double((data[4 * i + 2 + cols] << 8) + data[4 * i + 2 * cols + 2]),
                                    double((data[4 * i + 1 + cols] << 8) + data[4 * i + 2 * cols + 1]),
                                    double((data[4 * i + cols] << 8) + data[4 * i + 2 * cols]));
    _mm256_storeu_pd(&info[0] + 4 * i, _mm256_sqrt_pd(_mm256_mul_pd(divisor, dividend)));
  }
}

However, to my surprise, this code is slower than the normal one? Any ideas on how to speed it up?

I am compiling with c++ (7.3.0) with the following options -std=c++17 -Wall -Wextra -O3 -fno-tree-vectorize -mavx2. I have checked as explained here and my CPU (Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz) supports AVX2.

To check which one is faster is using time. The following function gives me timestamp:

inline double timestamp()
{
  struct timeval tp;
  gettimeofday(&tp, nullptr);
  return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}

I get timestamp before and after each function getData and getDataAVX2 and subtract them to get the elapsed time on each function. The overall main is the following:

int main(int argc, char** argv)
{


  u_char data[] = {
0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0x11, 0xf,  0xf,  0xf,  0xf,  0xf,  0x10, 0xf,  0xf,
0xf,  0xf,  0xe,  0x10, 0x10, 0xf,  0x10, 0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0xf,  0xf,  0x10, 0x10, 0xf,
0x10, 0xf,  0xe,  0xf,  0xf,  0x10, 0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0xf,
0xf,  0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0x10, 0xf,  0xf,  0xf,  0xf,  0xe,  0xf,  0xf,  0xf,  0xf,  0xf,  0x10,
0x10, 0xf,  0xf,  0xf,  0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xd3, 0xd1, 0xca, 0xc6, 0xd2, 0xd2, 0xcc, 0xc8, 0xc2, 0xd0, 0xd0,
0xca, 0xc9, 0xcb, 0xc7, 0xc3, 0xc7, 0xca, 0xce, 0xca, 0xc9, 0xc2, 0xc8, 0xc2, 0xbe, 0xc2, 0xc0, 0xb8, 0xc4, 0xbd,
0xc5, 0xc9, 0xbc, 0xbf, 0xbc, 0xb5, 0xb6, 0xc1, 0xbe, 0xb7, 0xb9, 0xc8, 0xb9, 0xb2, 0xb2, 0xba, 0xb4, 0xb4, 0xb7,
0xad, 0xb2, 0xb6, 0xab, 0xb7, 0xaf, 0xa7, 0xa8, 0xa5, 0xaa, 0xb0, 0xa3, 0xae, 0xa9, 0xa0, 0xa6, 0xa5, 0xa8, 0x9f,
0xa0, 0x9e, 0x94, 0x9f, 0xa3, 0x9d, 0x9f, 0x9c, 0x9e, 0x99, 0x9a, 0x97, 0x4,  0x5,  0x4,  0x5,  0x4,  0x4,  0x5,
0x5,  0x5,  0x4,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x4,  0x4,  0x4,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,
0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,
0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x5,  0x4,  0x4,  0x4,  0x5,  0x5,  0x5,  0x4,  0x4,
0x5,  0x5,  0x5,  0x5,  0x4,  0x5,  0x5,  0x4,  0x4,  0x6,  0x4,  0x4,  0x6,  0x5,  0x4,  0x5,  0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xe0, 0xf0, 0xe0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xe0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0
  };
  size_t cols = 80;

  // Normal
  std::cout << "Computing with normal way" << std::endl;
  std::vector<double> info;
  double tstart_normal = timestamp();
  getData(data, cols, info);
  double time_normal = timestamp() - tstart_normal;

  // AVX2
  std::cout << "Computing with avx" << std::endl;
  std::vector<double> info_avx2;
  double tstart_avx2 = timestamp();
  getDataAVX2(data, cols, info_avx2);
  double time_avx2 = timestamp() - tstart_avx2;

  // Display difference
  std::cout << "Time normal: " << time_normal << " s" << std::endl;
  std::cout << "Time AVX2:   " << time_avx2 << " s" << std::endl;
  std::cout << "Time improvement AVX2: " << time_normal / time_avx2 << std::endl;

  // Write to file
  std::ofstream file;
  file.open("out.csv");
  for (size_t i = 0; i < cols; i++)
  {
    file << info[size_t(i)] << "," << info_avx2[size_t(i)];
    file << std::endl;
  }
  file.close();

  // Exit
  return 0;
}

The full example can be found here.

Such a tiny amount of work in the timed interval is hard to measure accurately. cols = 80 is only 20 __m256d vectors.

Your test program on my Skylake system bounces around between 9.53674e-07 s, 1.19209e-06 s and 0 s for the times, with the AVX2 version usually faster. (I had a _mm_pause() busy-loop running on another core to peg all the cores at max speed. It's a desktop i7-6700k so all cores share the same core clock frequency.)

gettimeofday is apparently nowhere near precise enough to measure anything that short. struct timeval uses seconds and micro-seconds, not nanoseconds. But I did fairly consistently see the AVX2 version being faster on Skylake, compiled with g++ -O3 -march=native. I don't have a Haswell to test on. My Skylake is using hardware P-state power management, so even if I didn't peg the CPU frequency ahead of time, it would ramp up to max very quickly. Haswell doesn't have that feature, so that's another reason things can be weird in yours.

If you want to measure wall-clock time (instead of core clock cycles), use std::chrono like a normal person. Correct way of portably timing code using C++11.

Warm-up effects are going to dominate, and you're including the std::vector::resize() inside the timed interval. The two different std::vector<double> objects have to allocate memory separately, so maybe the 2nd one needs to get a new page from the OS and takes way longer. Maybe the first one was able to grab memory from the free-list, if something before main (or something in cout <<) did some temporary allocation and then shrunk or freed it.

There are many possibilities here: first, some people have reported seeing 256-bit vector instructions run slower for the first few microseconds on Haswell, like Agner Fog measured on Skylake.

Possibly the CPU decided to ramp up to max turbo during the 2nd timed interval (the AVX2 one). That takes maybe 20k clock cycles on an i7-4700MQ (2.4GHz Haswell). (Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC).

Maybe after a write system call (from cout <<) the TLB misses or branch misses hurt more for the 2nd function? (With Spectre + Meltdown mitigation enabled in your kernel, you should expect code to run slow right after returning from a system call.)

Since you didn't use -ffast-math, GCC won't have turned your scalar sqrt into a rsqrtss approximation, especially because it's double not float. Otherwise that could explain it.

Look at how the time scales with problem size to make sure your microbenchmark is sane, and unless your trying to measure transient / warm-up effects, repeat the work many times. If it doesn't optimize away, just slap a repeat loop around the function call inside the timed interval (instead of trying to add up times from multiple intervals). Check the compiler-generated asm, or at least check that the time scales linearly with the repeat count. You might make the function __attribute__((noinline,noclone)) as a way to defeat the optimizer from optimizing across repeat-loop iterations.

Outside of warm-up effects, your SIMD version should be about 2x as fast as scalar on your Haswell.

Both scalar and SIMD versions bottleneck on the divide unit, even with inefficient scalar calculation of inputs before merging into a __m256d. Haswell's FP divide/sqrt hardware is only 128 bits wide (so vsqrtpd ymm is split into two 128-bit halves). But scalar is only taking advantage of half the possible throughput.

float would give you a 4x throughput boost: twice as many elements per SIMD vector, and vsqrtps (packed-single) has twice the throughput of vsqrtpd (packed-double) on Haswell. (https://agner.org/optimize/). It would also make it easier to use x * approx_rsqrt(x) as a fast approximation for sqrt(x), probably with a Newton-Raphson iteration to get up from ~12 bit precision to ~24 (almost as accurate as _mm256_sqrt_ps). See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision. (If you had enough work to do in the same loop that you didn't bottleneck on divider throughput, the actual sqrt instruction can be good.)

You could SIMD sqrt with float and then convert to double if you really need your output format to be double for compat with the rest of your code.

Optimizing the stuff other than the sqrt:

This probably won't be any faster on Haswell, but it is probably more Hyperthreading-friendly if the other threads aren't using SQRT / DIV.

It uses SIMD to load and unpack the data: a<<8 + b is best done by interleaving bytes from b and a to make 16-bit integers, with _mm_unpacklo/hi_epi8. Then zero-extend to 32-bit integers so we can use SIMD int->double conversion.

This results in 4 vectors of double for each pair of __m128i of data. Using 256-bit vectors here would just introduce lane-crossing problems and require extracting down to 128 because of how _mm256_cvtepi32_pd(__m128i) works.

I changed to using _mm256_storeu_pd into the output directly, instead of hoping that gcc would optimize away the one-element-at-a-time assignment.

I also noticed that the compiler was reloading &info[0] after every store, because its alias-analysis couldn't prove that _mm256_storeu_pd was only modifying the vector data, not the control block. So I assigned the base address to a double* local variable that the compiler is sure isn't pointing to itself.

#include <immintrin.h>
#include <vector>

inline
__m256d cvt_scale_sqrt(__m128i vi){
    __m256d vd = _mm256_cvtepi32_pd(vi);
    vd = _mm256_mul_pd(vd, _mm256_set1_pd(1./64.));
    return _mm256_sqrt_pd(vd);
}

// assumes cols is a multiple of 16
// SIMD for everything before the multiple/sqrt as well
// but probably no speedup because this and others just bottleneck on that.
void getDataAVX2_vector_unpack(const u_char*__restrict data, size_t cols, std::vector<double>& info_vec)
{
  info_vec.resize(cols);    // TODO: hoist this out of the timed region

  double *info = &info_vec[0];  // our stores don't alias the vector control-block
                                // but gcc doesn't figure that out, so read the pointer into a local

  for (size_t i = 0; i < cols / 4; i+=4)
  {
      // 128-bit vectors because packed int->double expands to 256-bit
      __m128i a = _mm_loadu_si128((const __m128i*)&data[4 * i + cols]);   // 16 elements
      __m128i b = _mm_loadu_si128((const __m128i*)&data[4 * i + 2*cols]);
      __m128i lo16 = _mm_unpacklo_epi8(b,a);                // a<<8 | b  packed 16-bit integers
      __m128i hi16 = _mm_unpackhi_epi8(b,a);

      __m128i lo_lo = _mm_unpacklo_epi16(lo16, _mm_setzero_si128());
      __m128i lo_hi = _mm_unpackhi_epi16(lo16, _mm_setzero_si128());

      __m128i hi_lo = _mm_unpacklo_epi16(hi16, _mm_setzero_si128());
      __m128i hi_hi = _mm_unpackhi_epi16(hi16, _mm_setzero_si128());

      _mm256_storeu_pd(&info[4*(i + 0)], cvt_scale_sqrt(lo_lo));
      _mm256_storeu_pd(&info[4*(i + 1)], cvt_scale_sqrt(lo_hi));
      _mm256_storeu_pd(&info[4*(i + 2)], cvt_scale_sqrt(hi_lo));
      _mm256_storeu_pd(&info[4*(i + 3)], cvt_scale_sqrt(hi_hi));
  }
}

This compiles to a pretty nice loop on the Godbolt compiler explorer, with g++ -O3 -march=haswell.

To handle cols not being a multiple of 16, you'll need another version of the loop, or padding or something.

But having fewer instructions other than vsqrtpd doesn't help at all with that bottleneck.

According to IACA, all the SIMD loops on Haswell bottleneck on the divider unit, 28 cycles per vsqrtpd ymm, even your original which does a large amount of scalar work. 28 cycles is a long time.

For large inputs, Skylake should be a bit more than twice as fast because of its improved divider throughput. But float would still be a ~4x speedup, or more with vrsqrtps.

来源：https://stackoverflow.com/questions/51834409/avx2-code-slower-then-without-avx2

标签

c++

performance

x86

avx2