How can I vectorize an IF block using ARM Neon intrinsics?

问题

I want to process a large array of floating-point numbers on the ARM processor, using Neon technology to calculate them four at a time. Everything's fine for operations like add and multiply, but what do I do if my calculation goes into an IF block? Example:

// In the non-vectorized original code, A is an array of many floating-point
// numbers, which are calculated one at a time.  Now they're packed 
// into a vector and processed four at a time

...calculate A...

if (A > 10.f)
{
    A = A+5.f;
}
else
{
    A = A+10.f;
}

Now, which IF branch do I execute? What if some of the values in the vector being processed are greater than 10 and some are less? Is it even possible to vectorize code like this?

回答1:

I'll add to the answers so far by describing how to code it in Neon intrinsics.

In general, you don't do IF-block logic based on parallel register contents, because one value may require one branch of the IF block and a different value in the same register may require another. "Eager execution" means doing all the possible calculations first, and then deciding which results to actually use in which lanes. (Remember, you don't gain anything by doing a Neon calculation for only one lane of a register. Any computation that has to be done at all, gets done for all 2 or 4 lanes.)
To do an IF-based computation, use Neon conditional intrinsics e.g. "greater than" to make a bitmask, and then a "select" function to populate the final result according to the bitmask

double aval[2] = {11.5, 9.5};

float64x2_t AA= vld1q_f64(aval);       // an array with two 64-bit double values

float64x2 TEN= vmovq_n_f64(10.f);      // load a constant into a different array
float64x2 FIVE= vmovq_n_f64(5.f);      // load a constant into a different array

// Do both of the computations
float64x2 VALIFTRUE = vaddq_f64(AA, TEN);  // {21.5, 19.5}
float64x2 VALIFFALSE = vaddq_f64(AA, FIVE);  // {16.5, 14.5}


uint64x2_t IF1 = vcgtq_f64 (AA, TEN);  // comparison "(if A > 10.)"

The return value of vcgtq_f64 is not a set of doubles but two 64-bit unsigned integers. They're actually a bit mask that can be used by "bitwise select" functions such as vbslq_f64. The first 64 bits of IF1 are all 1's (the greater-than condition was true) and the second 64 bits are all 0's.

AA = vbslq_f64(IF1, VALIFTRUE, VALIFFALSE);  // {21.5, 14.5}

...and each lane of AA is populated with either VALIFTRUE or VALIFFALSE for that lane, as appropriate.

What if eager execution is just too slow--the computations in one branch are very costly in processor time and you want to avoid doing them at all if you can? You'd have to verify that that branch condition isn't true for any of the vector lanes and then skip over the computations with a proper "if" statement. Perhaps someone else can comment on how well this works out in practice.

回答2:

If-else slaloms are a nightmare for virtually all CPUs, especially for vector machines such as NEON that doesn't have any conditional branch on its own.

Hence we apply "eager execution" on problems like this.

A boolean mask gets created
Both if and else blocks are computed
The "right" result gets selected by the mask

I think it won't be problem converting the aarch32 code below to intrinsics.

//aarch32
    vadd.f32    vecElse, vecA, vecTen // vecTen contains 10.0f
    vcgt.f32    vecMask, vecA, vecTen
    vadd.f32    vecA, vecA, vecFive
    vbif        vecA, vecElse, vecMask

//aarch64
    fadd    vecElse.4s, vecA.4s, vecTen.4s
    fcmgt   vecMask.4s, vecA.4s, vecTen.4s
    fadd    vecA.4s, vecA.4s, vecFive.4s
    bif     vecA.16b, vecElse.16b, vecMask.16b

回答3:

In general with SIMD branching logic you use a compare mask and then select alternate results accordingly. I'll give pseudo code for your example and you can convert this to intrinsics or asm as needed:

v5 = vector(5)              // set up some constant vectors
v10 = vector(10)
vMask = compare_gt(vA, v10) // generate mask for vector compare A > 10
va = add(vA, v10)           // vA = vA + 10 (all elements, unconditionally)
vtemp = and(v5, vMask)      // generate temp vector of 5 and 0 values based on mask
va = sub(vA, vTemp)         // subtract 5 from elements which are <= 10

来源：https://stackoverflow.com/questions/47320062/how-can-i-vectorize-an-if-block-using-arm-neon-intrinsics

标签

arm

neon