I made my first approach with vectorization intrinsics with SSE, where there is basically only one data type __m128i
. Switching to Neon I found the data types a
Since the initial proposed method has undefined behaviour in C++, I have implemented something like this:
template <typename T>
struct NeonVectorType {
private:
T data;
public:
template <typename U>
operator U () {
BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
U u;
memcpy( &u, &data, sizeof u );
return u;
}
template <typename U>
NeonVectorType<T>& operator =(const U& in) {
BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
memcpy( &data, &in, sizeof data );
return *this;
}
};
Then:
typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.
The use of memcpy is discussed here (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away.
If you look at the edit history, I had implemented a custom version with combine operators for vectors of vectors (e.g. uint8x8x2_t
). The problem was mentioned here. However, since those data types are declared as arrays (see guide, section 12.2.2) and therefore located in consecutive memory locations, the compiler is bound to treat the memcpy
correctly.
Finally, to print the content of the variable one could use a function like this.
If you try to avoid casting in a sensible way by various data structures hackery, you'll end up shuffling memory / words around which will kill any performance you're hoping to get from NEON.
You can probably cast down quad registers to double registers easily but other way might not be possible.
Everything boils down to this. In each instruction there are a few bits to index registers. If instruction expects Quad registers it will count registers two-by-two like Q(2*n), Q(2*n+1) and only use n in encoded instruction, (2*n+1) will be implicit for core. If any point in your code you are trying to cast two double into a quad you may be in a position where those are not consecutive forcing compiler to shuffle around registers into stack and back to get consecutive layout.
I think it is still the same answer in different words https://stackoverflow.com/a/13734838/1163019
NEON instructions are designed to be streaming, you load from memory in big chunks, process it, then store what you want back. This should be all very simple mechanics, if not you'll loose extra performance it offers which will make people ask why you're trying to utilize Neon in the first place making life harder for yourself.
Think NEON as immutable value types and operations.
According to the C++ Standard, this data type is nearly useless (and certainly so for the purpose you intend). That's because reading from an inactive member of a union is undefined behavior.
It is possible, however, that your compiler promises to make this work. However, you haven't asked about any particular compiler, so it is impossible to comment further on that.