Using an union (encapsulated in a struct) to bypass conversions for neon data types

核能气质少年 提交于 2019-11-27 15:53:17

According to the C++ Standard, this data type is nearly useless (and certainly so for the purpose you intend). That's because reading from an inactive member of a union is undefined behavior.

It is possible, however, that your compiler promises to make this work. However, you haven't asked about any particular compiler, so it is impossible to comment further on that.

Antonio

Since the initial proposed method has undefined behaviour in C++, I have implemented something like this:

template <typename T>
struct NeonVectorType {

    private:
    T data;

    public:
    template <typename U>
    operator U () {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to convert to data type of different size");
        U u;
        memcpy( &u, &data, sizeof u );
        return u;
    }

    template <typename U>
    NeonVectorType<T>& operator =(const U& in) {
        BOOST_STATIC_ASSERT_MSG(sizeof(U) == sizeof(T),"Trying to copy from data type of different size");
        memcpy( &data, &in, sizeof data );
        return *this;
    }

};

Then:

typedef NeonVectorType<uint8x16_t> uint_128bit_t; //suitable for uint8x16_t, uint8x8x2_t, uint32x4_t, etc.
typedef NeonVectorType<uint8x8_t> uint_64bit_t; //suitable for uint8x8_t, uint32x2_t, etc.

The use of memcpy is discussed here (and here), and avoids breaking the strict aliasing rule. Note that in general it gets optimized away.

If you look at the edit history, I had implemented a custom version with combine operators for vectors of vectors (e.g. uint8x8x2_t). The problem was mentioned here. However, since those data types are declared as arrays (see guide, section 12.2.2) and therefore located in consecutive memory locations, the compiler is bound to treat the memcpy correctly.

Finally, to print the content of the variable one could use a function like this.

auselen

If you try to avoid casting in a sensible way by various data structures hackery, you'll end up shuffling memory / words around which will kill any performance you're hoping to get from NEON.

You can probably cast down quad registers to double registers easily but other way might not be possible.

Everything boils down to this. In each instruction there are a few bits to index registers. If instruction expects Quad registers it will count registers two-by-two like Q(2*n), Q(2*n+1) and only use n in encoded instruction, (2*n+1) will be implicit for core. If any point in your code you are trying to cast two double into a quad you may be in a position where those are not consecutive forcing compiler to shuffle around registers into stack and back to get consecutive layout.

I think it is still the same answer in different words https://stackoverflow.com/a/13734838/1163019

NEON instructions are designed to be streaming, you load from memory in big chunks, process it, then store what you want back. This should be all very simple mechanics, if not you'll loose extra performance it offers which will make people ask why you're trying to utilize Neon in the first place making life harder for yourself.

Think NEON as immutable value types and operations.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!