Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

前端 未结 5 480
天命终不由人
天命终不由人 2021-01-12 13:58

I\'m looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I\'m currently using 3 OR operations, and 2 MOVs:

         


        
相关标签:
5条回答
  • 2021-01-12 14:41

    I'd avoid functions returning integer values that should only be interpreted as bool. A better way would be, for instance, defining a helper function to return maximum unsigned value of 4 lanes:

    inline uint32_t max_lane_value_u32(const uint32x4_t& v)
    {
    #if defined(_WIN32) && defined(_ARM64_)
        // Windows 64-bit
        return neon_umaxvq32(v);
    #elif defined(__LP64__)
        // Linux/Android 64-bit
        return vmaxvq_u32(v);
    #else
        // Windows/Linux/Android 32-bit
        uint32x2_t result = vmax_u32(vget_low_u32(v), vget_high_u32(v));
        return vget_lane_u32(vpmax_u32(result, result), 0);
    #endif
    }
    

    you can then use:

    if (0 == max_lane_value_u32(v))
    {
        ...
    }
    

    in your code, and such function might also be useful elsewhere. Alternatively, you can use the exact same code to write a is_not_zero() function, but then it's best form to return a bool.

    Note that the only reason you'd need to define a helper function is because vmaxvq_u32() is not available on 32-bit, and may not be aliased from neon_umaxvq32() in arm64_neon.h on Windows.

    0 讨论(0)
  • 2021-01-12 14:45

    If you're targeting AArch64 NEON, you can use the following to get a value to test with just two instructions:

    inline uint64_t is_not_zero(uint32x4_t v)
    {
        uint64x2_t v64 = vreinterpretq_u64_u32(v);
        uint32x2_t v32 = vqmovn_u64(v64);
        uint64x1_t result = vreinterpret_u64_u32(v32);
        return result[0];
    }
    
    0 讨论(0)
  • 2021-01-12 14:45

    You seem to be looking for intrinsics and this is the way:

    inline bool is_zero(int32x4_t v) noexcept
    {
      v = v == int32x4{};
    
      return !int32x2_t(
        vtbl2_s8(
          int8x8x2_t{
            int8x8_t(vget_low_s32(v)),
            int8x8_t(vget_high_s32(v))
          },
          int8x8_t{0, 4, 8, 12}
        )
      )[0];
    }
    

    Nils Pipenbrinck's answer has a flaw in that he assumes the QC, cumulative saturation flag to be clear.

    0 讨论(0)
  • 2021-01-12 14:52

    While this answer may be a bit late, there is a simple way to do the test with only 3 instructions and no extra registers:

    inline uint32_t is_not_zero(uint32x4_t v)
    {
        uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
        return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
    }
    

    The return value will be nonzero if any bit in the 128-bit NEON register was set.

    0 讨论(0)
  • 2021-01-12 14:56

    If you have AArch64 you can do it even easier. They have a new instruction for designed for this.

    inline uint32_t is_not_zero(uint32x4_t v)
    {
        return vaddvq_u32(v);
    }
    
    0 讨论(0)
提交回复
热议问题