I\'m looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I\'m currently using 3 OR operations, and 2 MOVs:
I'd avoid functions returning integer values that should only be interpreted as bool. A better way would be, for instance, defining a helper function to return maximum unsigned value of 4 lanes:
inline uint32_t max_lane_value_u32(const uint32x4_t& v)
{
#if defined(_WIN32) && defined(_ARM64_)
// Windows 64-bit
return neon_umaxvq32(v);
#elif defined(__LP64__)
// Linux/Android 64-bit
return vmaxvq_u32(v);
#else
// Windows/Linux/Android 32-bit
uint32x2_t result = vmax_u32(vget_low_u32(v), vget_high_u32(v));
return vget_lane_u32(vpmax_u32(result, result), 0);
#endif
}
you can then use:
if (0 == max_lane_value_u32(v))
{
...
}
in your code, and such function might also be useful elsewhere. Alternatively, you can use the exact same code to write a is_not_zero() function, but then it's best form to return a bool.
Note that the only reason you'd need to define a helper function is because vmaxvq_u32() is not available on 32-bit, and may not be aliased from neon_umaxvq32() in arm64_neon.h on Windows.
If you're targeting AArch64 NEON, you can use the following to get a value to test with just two instructions:
inline uint64_t is_not_zero(uint32x4_t v)
{
uint64x2_t v64 = vreinterpretq_u64_u32(v);
uint32x2_t v32 = vqmovn_u64(v64);
uint64x1_t result = vreinterpret_u64_u32(v32);
return result[0];
}
You seem to be looking for intrinsics and this is the way:
inline bool is_zero(int32x4_t v) noexcept
{
v = v == int32x4{};
return !int32x2_t(
vtbl2_s8(
int8x8x2_t{
int8x8_t(vget_low_s32(v)),
int8x8_t(vget_high_s32(v))
},
int8x8_t{0, 4, 8, 12}
)
)[0];
}
Nils Pipenbrinck's answer has a flaw in that he assumes the QC, cumulative saturation flag to be clear.
While this answer may be a bit late, there is a simple way to do the test with only 3 instructions and no extra registers:
inline uint32_t is_not_zero(uint32x4_t v)
{
uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
}
The return value will be nonzero if any bit in the 128-bit NEON register was set.
If you have AArch64 you can do it even easier. They have a new instruction for designed for this.
inline uint32_t is_not_zero(uint32x4_t v)
{
return vaddvq_u32(v);
}