How to solve the 32-byte-alignment issue for AVX load/store operations?

问题

I am having alignment issue while using ymm registers, with some snippets of code that seems fine to me. Here is a minimal working example:

#include <iostream> 
#include <immintrin.h>

inline void ones(float *a)
{
     __m256 out_aligned = _mm256_set1_ps(1.0f);
     _mm256_store_ps(a,out_aligned);
}

int main()
{
     size_t ss = 8;
     float *a = new float[ss];
     ones(a);

     delete [] a;

     std::cout << \"All Good!\" << std::endl;
     return 0;
}

Certainly, sizeof(float) is 4 on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I\'m compiling with gcc using -O3 -march=native flags. Of course the error goes away with unaligned memory access i.e. specifying _mm256_storeu_ps. I also do not have this problem on xmm registers, i.e.

inline void ones_sse(float *a)
{
     __m128 out_aligned = _mm_set1_ps(1.0f);
     _mm_store_ps(a,out_aligned);
}

Am I doing anything foolish? what is the work-around for this?

回答1:

The standard allocators normally only align to alignof(max_align_t), which is often 16B, e.g. long double in the x86-64 System V ABI. But in some 32-bit ABIs it's only 8B, so it's not even sufficient for dynamic allocation of aligned __m128 vectors and you'll need to go beyond simply calling new or malloc.

Static and automatic storage are easy: use alignas(32) float arr[N];

C++17 provides aligned new for aligned dynamic allocation that's compatible with delete:
float * arr = new (std::align_val_t(32)) float[numSteps];
See documentation for new/new[] and std::align_val_t

Other options for dynamic allocation are mostly compatible with malloc/free, not new/delete:

std::aligned_alloc: ISO C++17. major downside: size must be a multiple of alignment. This braindead requirement makes it inappropriate for allocating a 64B cache-line aligned array of an unknown number of floats, for example. Or especially a 2M-aligned array to take advantage of transparent hugepages.

The C version of aligned_alloc was added in ISO C11. It's available in some but not all C++ compilers. As noted on the cppreference page, the C11 version wasn't required to fail when size isn't a multiple of alignment (it's undefined behaviour), so many implementations provided the obvious desired behaviour as an "extension". Discussion is underway to fix this, but for now I can't really recommend aligned_alloc as a portable way to allocate arbitrary-sized arrays.

Also, commenters report it's unavailable in MSVC++. See best cross-platform method to get aligned memory for a viable #ifdef for Windows. But AFAIK there are no Windows aligned-allocation functions that produce pointers compatible with standard free.
posix_memalign: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared to aligned_alloc. I've seen gcc generate reloads of the pointer because it wasn't sure that stores into the buffer didn't modify the pointer. (Since posix_memalign is passed the address of the pointer.) So if you use this, copy the pointer into another C++ variable that hasn't had its address passed outside the function.

#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size);  // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size);                // C11 (and ISO C++17)

_mm_malloc: Available on any platform where _mm_whatever_ps is available, but you can't pass pointers from it to free. On many C and C++ implementations _mm_free and free are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.) On MSVC on Windows, _mm_malloc uses _aligned_malloc, which is not compatible with free; it crashes in practice.

In C++11 and later: use alignas(32) float avx_array[1234] as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage documentation has an example of this technique to explain what std::aligned_storage does.

This doesn't actually work for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>), see Making std::vector allocate aligned memory.

In C++17, there might be a way to use aligned new for std::vector. TODO: find out how.

And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and add do p+=31; p&=~31ULL with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256 intrinsics. But there are even library functions that will help you do this, IIRC.

The requirement to use _mm_free instead of free probably exists to for the possibility of implementing _mm_malloc on top of a plain old malloc using this technique.

回答2:

There are the two intrinsics for memory management. _mm_malloc operates like a standard malloc, but it takes an additional parameter that specifies the desired alignment. In this case, a 32 byte alignment. When this allocation method is used, memory must be freed by the corresponding _mm_free call.

float *a = static_cast<float*>(_mm_malloc(sizeof(float) * ss , 32));
...
_mm_free(a);

回答3:

You'll need aligned allocators.

But there isn't a reason you can't bundle them up:

template<class T, size_t align>
struct aligned_free {
  void operator()(T* t)const{
    ASSERT(!(uint_ptr(t) % align));
    _mm_free(t);
  }
  aligned_free() = default;
  aligned_free(aligned_free const&) = default;
  aligned_free(aligned_free&&) = default;
  // allow assignment from things that are
  // more aligned than we are:
  template<size_t o,
    std::enable_if_t< !(o % align) >* = nullptr
  >
  aligned_free( aligned_free<T, o> ) {}
};
template<class T>
struct aligned_free<T[]>:aligned_free<T>{};

template<class T, size_t align=1>
using mm_ptr = std::unique_ptr< T, aligned_free<T, align> >;
template<class T, size_t align>
struct aligned_make;
template<class T, size_t align>
struct aligned_make<T[],align> {
  mm_ptr<T, align> operator()(size_t N)const {
    return mm_ptr<T, align>(static_cast<T*>(_mm_malloc(sizeof(T)*N, align)));
  }
};
template<class T, size_t align>
struct aligned_make {
  mm_ptr<T, align> operator()()const {
    return aligned_make<T[],align>{}(1);
  }
};
template<class T, size_t N, size_t align>
struct aligned_make<T[N], align> {
  mm_ptr<T, align> operator()()const {
    return aligned_make<T[],align>{}(N);
  }
}:
// T[N] and T versions:
template<class T, size_t align>
auto make_aligned()
-> std::result_of_t<aligned_make<T,align>()>
{
  return aligned_make<T,align>{}();
}
// T[] version:
template<class T, size_t align>
auto make_aligned(size_t N)
-> std::result_of_t<aligned_make<T,align>(size_t)>
{
  return aligned_make<T,align>{}(N);
}

now mm_ptr<float[], 4> is a unique pointer to an array of floats that is 4 byte aligned. You create it via make_aligned<float[], 4>(20), which creates 20 floats 4-byte aligned, or make_aligned<float[20], 4>() (compile-time constant only in that syntax). make_aligned<float[20],4> returns mm_ptr<float[],4> not mm_ptr<float[20],4>.

A mm_ptr<float[], 8> can move-construct a mm_ptr<float[],4> but not vice-versa, which I think is nice.

mm_ptr<float[]> can take any alignment, but guarantees none.

Overhead, like with a std::unique_ptr, is basically zero per pointer. Code overhead can be minimized by aggressive inlineing.

来源：https://stackoverflow.com/questions/32612190/how-to-solve-the-32-byte-alignment-issue-for-avx-load-store-operations

标签

c++