AVX2 code slower then without AVX2

瘦欲@ 提交于 2019-12-05 14:29:20

Such a tiny amount of work in the timed interval is hard to measure accurately. cols = 80 is only 20 __m256d vectors.

Your test program on my Skylake system bounces around between 9.53674e-07 s, 1.19209e-06 s and 0 s for the times, with the AVX2 version usually faster. (I had a _mm_pause() busy-loop running on another core to peg all the cores at max speed. It's a desktop i7-6700k so all cores share the same core clock frequency.)

gettimeofday is apparently nowhere near precise enough to measure anything that short. struct timeval uses seconds and micro-seconds, not nanoseconds. But I did fairly consistently see the AVX2 version being faster on Skylake, compiled with g++ -O3 -march=native. I don't have a Haswell to test on. My Skylake is using hardware P-state power management, so even if I didn't peg the CPU frequency ahead of time, it would ramp up to max very quickly. Haswell doesn't have that feature, so that's another reason things can be weird in yours.

If you want to measure wall-clock time (instead of core clock cycles), use std::chrono like a normal person. Correct way of portably timing code using C++11.


Warm-up effects are going to dominate, and you're including the std::vector::resize() inside the timed interval. The two different std::vector<double> objects have to allocate memory separately, so maybe the 2nd one needs to get a new page from the OS and takes way longer. Maybe the first one was able to grab memory from the free-list, if something before main (or something in cout <<) did some temporary allocation and then shrunk or freed it.

There are many possibilities here: first, some people have reported seeing 256-bit vector instructions run slower for the first few microseconds on Haswell, like Agner Fog measured on Skylake.

Possibly the CPU decided to ramp up to max turbo during the 2nd timed interval (the AVX2 one). That takes maybe 20k clock cycles on an i7-4700MQ (2.4GHz Haswell). (Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC).

Maybe after a write system call (from cout <<) the TLB misses or branch misses hurt more for the 2nd function? (With Spectre + Meltdown mitigation enabled in your kernel, you should expect code to run slow right after returning from a system call.)

Since you didn't use -ffast-math, GCC won't have turned your scalar sqrt into a rsqrtss approximation, especially because it's double not float. Otherwise that could explain it.


Look at how the time scales with problem size to make sure your microbenchmark is sane, and unless your trying to measure transient / warm-up effects, repeat the work many times. If it doesn't optimize away, just slap a repeat loop around the function call inside the timed interval (instead of trying to add up times from multiple intervals). Check the compiler-generated asm, or at least check that the time scales linearly with the repeat count. You might make the function __attribute__((noinline,noclone)) as a way to defeat the optimizer from optimizing across repeat-loop iterations.


Outside of warm-up effects, your SIMD version should be about 2x as fast as scalar on your Haswell.

Both scalar and SIMD versions bottleneck on the divide unit, even with inefficient scalar calculation of inputs before merging into a __m256d. Haswell's FP divide/sqrt hardware is only 128 bits wide (so vsqrtpd ymm is split into two 128-bit halves). But scalar is only taking advantage of half the possible throughput.

float would give you a 4x throughput boost: twice as many elements per SIMD vector, and vsqrtps (packed-single) has twice the throughput of vsqrtpd (packed-double) on Haswell. (https://agner.org/optimize/). It would also make it easier to use x * approx_rsqrt(x) as a fast approximation for sqrt(x), probably with a Newton-Raphson iteration to get up from ~12 bit precision to ~24 (almost as accurate as _mm256_sqrt_ps). See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision. (If you had enough work to do in the same loop that you didn't bottleneck on divider throughput, the actual sqrt instruction can be good.)

You could SIMD sqrt with float and then convert to double if you really need your output format to be double for compat with the rest of your code.


Optimizing the stuff other than the sqrt:

This probably won't be any faster on Haswell, but it is probably more Hyperthreading-friendly if the other threads aren't using SQRT / DIV.

It uses SIMD to load and unpack the data: a<<8 + b is best done by interleaving bytes from b and a to make 16-bit integers, with _mm_unpacklo/hi_epi8. Then zero-extend to 32-bit integers so we can use SIMD int->double conversion.

This results in 4 vectors of double for each pair of __m128i of data. Using 256-bit vectors here would just introduce lane-crossing problems and require extracting down to 128 because of how _mm256_cvtepi32_pd(__m128i) works.

I changed to using _mm256_storeu_pd into the output directly, instead of hoping that gcc would optimize away the one-element-at-a-time assignment.

I also noticed that the compiler was reloading &info[0] after every store, because its alias-analysis couldn't prove that _mm256_storeu_pd was only modifying the vector data, not the control block. So I assigned the base address to a double* local variable that the compiler is sure isn't pointing to itself.

#include <immintrin.h>
#include <vector>

inline
__m256d cvt_scale_sqrt(__m128i vi){
    __m256d vd = _mm256_cvtepi32_pd(vi);
    vd = _mm256_mul_pd(vd, _mm256_set1_pd(1./64.));
    return _mm256_sqrt_pd(vd);
}

// assumes cols is a multiple of 16
// SIMD for everything before the multiple/sqrt as well
// but probably no speedup because this and others just bottleneck on that.
void getDataAVX2_vector_unpack(const u_char*__restrict data, size_t cols, std::vector<double>& info_vec)
{
  info_vec.resize(cols);    // TODO: hoist this out of the timed region

  double *info = &info_vec[0];  // our stores don't alias the vector control-block
                                // but gcc doesn't figure that out, so read the pointer into a local

  for (size_t i = 0; i < cols / 4; i+=4)
  {
      // 128-bit vectors because packed int->double expands to 256-bit
      __m128i a = _mm_loadu_si128((const __m128i*)&data[4 * i + cols]);   // 16 elements
      __m128i b = _mm_loadu_si128((const __m128i*)&data[4 * i + 2*cols]);
      __m128i lo16 = _mm_unpacklo_epi8(b,a);                // a<<8 | b  packed 16-bit integers
      __m128i hi16 = _mm_unpackhi_epi8(b,a);

      __m128i lo_lo = _mm_unpacklo_epi16(lo16, _mm_setzero_si128());
      __m128i lo_hi = _mm_unpackhi_epi16(lo16, _mm_setzero_si128());

      __m128i hi_lo = _mm_unpacklo_epi16(hi16, _mm_setzero_si128());
      __m128i hi_hi = _mm_unpackhi_epi16(hi16, _mm_setzero_si128());

      _mm256_storeu_pd(&info[4*(i + 0)], cvt_scale_sqrt(lo_lo));
      _mm256_storeu_pd(&info[4*(i + 1)], cvt_scale_sqrt(lo_hi));
      _mm256_storeu_pd(&info[4*(i + 2)], cvt_scale_sqrt(hi_lo));
      _mm256_storeu_pd(&info[4*(i + 3)], cvt_scale_sqrt(hi_hi));
  }
}

This compiles to a pretty nice loop on the Godbolt compiler explorer, with g++ -O3 -march=haswell.

To handle cols not being a multiple of 16, you'll need another version of the loop, or padding or something.

But having fewer instructions other than vsqrtpd doesn't help at all with that bottleneck.

According to IACA, all the SIMD loops on Haswell bottleneck on the divider unit, 28 cycles per vsqrtpd ymm, even your original which does a large amount of scalar work. 28 cycles is a long time.

For large inputs, Skylake should be a bit more than twice as fast because of its improved divider throughput. But float would still be a ~4x speedup, or more with vrsqrtps.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!