问题
I am having alignment issue while using ymm
registers, with some snippets of code that seems fine to me. Here is a minimal working example:
#include <iostream>
#include <immintrin.h>
inline void ones(float *a)
{
__m256 out_aligned = _mm256_set1_ps(1.0f);
_mm256_store_ps(a,out_aligned);
}
int main()
{
size_t ss = 8;
float *a = new float[ss];
ones(a);
delete [] a;
std::cout << \"All Good!\" << std::endl;
return 0;
}
Certainly, sizeof(float)
is 4
on my architecture (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz) and I\'m compiling with gcc
using -O3 -march=native
flags. Of course the error goes away with unaligned memory access i.e. specifying _mm256_storeu_ps
. I also do not have this problem on xmm
registers, i.e.
inline void ones_sse(float *a)
{
__m128 out_aligned = _mm_set1_ps(1.0f);
_mm_store_ps(a,out_aligned);
}
Am I doing anything foolish? what is the work-around for this?
回答1:
The standard allocators normally only align to alignof(max_align_t)
, which is often 16B, e.g. long double
in the x86-64 System V ABI. But in some 32-bit ABIs it's only 8B, so it's not even sufficient for dynamic allocation of aligned __m128
vectors and you'll need to go beyond simply calling new
or malloc
.
Static and automatic storage are easy: use alignas(32) float arr[N];
C++17 provides aligned new
for aligned dynamic allocation that's compatible with delete
:float * arr = new (std::align_val_t(32)) float[numSteps];
See documentation for new/new[] and std::align_val_t
Other options for dynamic allocation are mostly compatible with malloc
/free
, not new
/delete
:
std::aligned_alloc: ISO C++17. major downside: size must be a multiple of alignment. This braindead requirement makes it inappropriate for allocating a 64B cache-line aligned array of an unknown number of
float
s, for example. Or especially a 2M-aligned array to take advantage of transparent hugepages.The C version of aligned_alloc was added in ISO C11. It's available in some but not all C++ compilers. As noted on the cppreference page, the C11 version wasn't required to fail when size isn't a multiple of alignment (it's undefined behaviour), so many implementations provided the obvious desired behaviour as an "extension". Discussion is underway to fix this, but for now I can't really recommend
aligned_alloc
as a portable way to allocate arbitrary-sized arrays.Also, commenters report it's unavailable in MSVC++. See best cross-platform method to get aligned memory for a viable
#ifdef
for Windows. But AFAIK there are no Windows aligned-allocation functions that produce pointers compatible with standardfree
.posix_memalign: Part of POSIX 2001, not any ISO C or C++ standard. Clunky prototype/interface compared to
aligned_alloc
. I've seen gcc generate reloads of the pointer because it wasn't sure that stores into the buffer didn't modify the pointer. (Sinceposix_memalign
is passed the address of the pointer.) So if you use this, copy the pointer into another C++ variable that hasn't had its address passed outside the function.
#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size); // POSIX 2001
void *aligned_alloc(size_t alignment, size_t size); // C11 (and ISO C++17)
_mm_malloc
: Available on any platform where_mm_whatever_ps
is available, but you can't pass pointers from it tofree
. On many C and C++ implementations_mm_free
andfree
are compatible, but it's not guaranteed to be portable. (And unlike the other two, it will fail at run-time, not compile time.) On MSVC on Windows,_mm_malloc
uses _aligned_malloc, which is not compatible withfree
; it crashes in practice.
In C++11 and later: use alignas(32) float avx_array[1234]
as the first member of a struct/class member (or on a plain array directly) so static and automatic storage objects of that type will have 32B alignment. std::aligned_storage documentation has an example of this technique to explain what std::aligned_storage
does.
This doesn't actually work for dynamically-allocated storage (like a std::vector<my_class_with_aligned_member_array>
), see Making std::vector allocate aligned memory.
In C++17, there might be a way to use aligned new for std::vector
. TODO: find out how.
And finally, the last option is so bad it's not even part of the list: allocate a larger buffer and add do p+=31; p&=~31ULL
with appropriate casting. Too many drawbacks (hard to free, wastes memory) to be worth discussing, since aligned-allocation functions are available on every platform that support Intel _mm256
intrinsics. But there are even library functions that will help you do this, IIRC.
The requirement to use _mm_free
instead of free
probably exists to for the possibility of implementing _mm_malloc
on top of a plain old malloc
using this technique.
回答2:
There are the two intrinsics for memory management. _mm_malloc operates like a standard malloc, but it takes an additional parameter that specifies the desired alignment. In this case, a 32 byte alignment. When this allocation method is used, memory must be freed by the corresponding _mm_free call.
float *a = static_cast<float*>(_mm_malloc(sizeof(float) * ss , 32));
...
_mm_free(a);
回答3:
You'll need aligned allocators.
But there isn't a reason you can't bundle them up:
template<class T, size_t align>
struct aligned_free {
void operator()(T* t)const{
ASSERT(!(uint_ptr(t) % align));
_mm_free(t);
}
aligned_free() = default;
aligned_free(aligned_free const&) = default;
aligned_free(aligned_free&&) = default;
// allow assignment from things that are
// more aligned than we are:
template<size_t o,
std::enable_if_t< !(o % align) >* = nullptr
>
aligned_free( aligned_free<T, o> ) {}
};
template<class T>
struct aligned_free<T[]>:aligned_free<T>{};
template<class T, size_t align=1>
using mm_ptr = std::unique_ptr< T, aligned_free<T, align> >;
template<class T, size_t align>
struct aligned_make;
template<class T, size_t align>
struct aligned_make<T[],align> {
mm_ptr<T, align> operator()(size_t N)const {
return mm_ptr<T, align>(static_cast<T*>(_mm_malloc(sizeof(T)*N, align)));
}
};
template<class T, size_t align>
struct aligned_make {
mm_ptr<T, align> operator()()const {
return aligned_make<T[],align>{}(1);
}
};
template<class T, size_t N, size_t align>
struct aligned_make<T[N], align> {
mm_ptr<T, align> operator()()const {
return aligned_make<T[],align>{}(N);
}
}:
// T[N] and T versions:
template<class T, size_t align>
auto make_aligned()
-> std::result_of_t<aligned_make<T,align>()>
{
return aligned_make<T,align>{}();
}
// T[] version:
template<class T, size_t align>
auto make_aligned(size_t N)
-> std::result_of_t<aligned_make<T,align>(size_t)>
{
return aligned_make<T,align>{}(N);
}
now mm_ptr<float[], 4>
is a unique pointer to an array of float
s that is 4 byte aligned. You create it via make_aligned<float[], 4>(20)
, which creates 20 floats 4-byte aligned, or make_aligned<float[20], 4>()
(compile-time constant only in that syntax). make_aligned<float[20],4>
returns mm_ptr<float[],4>
not mm_ptr<float[20],4>
.
A mm_ptr<float[], 8>
can move-construct a mm_ptr<float[],4>
but not vice-versa, which I think is nice.
mm_ptr<float[]>
can take any alignment, but guarantees none.
Overhead, like with a std::unique_ptr
, is basically zero per pointer. Code overhead can be minimized by aggressive inline
ing.
来源:https://stackoverflow.com/questions/32612190/how-to-solve-the-32-byte-alignment-issue-for-avx-load-store-operations