Fastest way to test a 128 bit NEON register for a value of 0 using intrinsics?

∥☆過路亽.° 提交于 2019-12-01 01:59:05

问题


I'm looking for the fastest way to test if a 128 NEON register contains all zeros, using NEON intrinsics. I'm currently using 3 OR operations, and 2 MOVs:

uint32x4_t vr = vorrq_u32(vcmp0, vcmp1);

uint64x2_t v0 = vreinterpretq_u64_u32(vr);
uint64x1_t v0or = vorr_u64(vget_high_u64(v0), vget_low_u64(v0));

uint32x2_t v1 = vreinterpret_u32_u64 (v0or);
uint32_t r = vget_lane_u32(v1, 0) | vget_lane_u32(v1, 1);

if (r == 0) { // do stuff }

This translates by gcc to the following assembly code:

VORR     q9, q9, q10
VORR     d16, d18, d19
VMOV.32  r3, d16[0]
VMOV.32  r2, d16[1]
VORRS    r2, r2, r3
BEQ      ...

Does anyone have an idea of a faster way?


回答1:


While this answer may be a bit late, there is a simple way to do the test with only 3 instructions and no extra registers:

inline uint32_t is_not_zero(uint32x4_t v)
{
    uint32x2_t tmp = vorr_u32(vget_low_u32(v), vget_high_u32(v));
    return vget_lane_u32(vpmax_u32(tmp, tmp), 0);
}

The return value will be nonzero if any bit in the 128-bit NEON register was set.




回答2:


If you're targeting AArch64 NEON, you can use the following to get a value to test with just two instructions:

inline uint64_t is_not_zero(uint32x4_t v)
{
    uint64x2_t v64 = vreinterpretq_u64_u32(v);
    uint32x2_t v32 = vqmovn_u64(v64);
    uint64x1_t result = vreinterpret_u64_u32(v32);
    return result[0];
}



回答3:


You seem to be looking for intrinsics and this is the way:

inline bool is_zero(int32x4_t v) noexcept
{
  v = v == int32x4{};

  return !int32x2_t(
    vtbl2_s8(
      int8x8x2_t{
        int8x8_t(vget_low_s32(v)),
        int8x8_t(vget_high_s32(v))
      },
      int8x8_t{0, 4, 8, 12}
    )
  )[0];
}

Nils Pipenbrinck's answer has a flaw in that he assumes the QC, cumulative saturation flag to be clear.




回答4:


If you have AArch64 you can do it even easier. They have a new instruction for designed for this.

inline uint32_t is_not_zero(uint32x4_t v)
{
    return vaddvq_u32(v);
}


来源:https://stackoverflow.com/questions/15389539/fastest-way-to-test-a-128-bit-neon-register-for-a-value-of-0-using-intrinsics

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!