SIMD vectorize atan2 using ARM NEON assembly

问题

I want to calculate the magnitude and the angle of 4 points using neon instructions SIMD and arm assembly. There is a built in library in most languages, C++ in my case, which calculates the angle (atan2) but for only one pair of floating point variables (x and y). I would like to exploit SIMD instructions that deal with q registers in order to calculate atan2 for a vector of 4 values.

The accuracy is required not to be high, the speed is more important.

I already have a few assembly instructions which calculate the magnitude of 4 floating-point registers, with acceptable accuracy for my application. q1 contains 4 "x" values (x1, x2, x3, x4). q2 contains 4 "y" values (y1, y2, y3, y4). q7 contains the magnitude of the 4 results (x1^2 + y1^2, x2^2 + y2^2, x3^2 + y3^2, x4^2 + y4^2).

vmul.f32 q7, q1, q1  
vmla.f32 q7, q2, q2    
vrecpe.f32  q7, q7   
vrsqrte.f32 q7, q7

What is the fastest way to calculate an approximate atan2 for two vectors using SIMD instructions?

回答1:

See math-neon for an existing single valued float implementation. As it has no (or little) conditionals, it should translate well to an SIMD implementation.

As the ARM NEON doesn't have an instruction to calculate this directly, then there are various techniques to create an approximation that are better than a Taylor series. Specifically, the min-max approach gives a good polynomial candidate for approximation. min-max refers to minimizing the maximum error; with a Chebyshev approximation usually being very good.

DSP guru has specifics on different methods for function approximation. There are also numerous books on-line. You can search for optimum polynomials using matlab, octave or some other tool-kit. Typically, you need to bound this with a range and precision. Once you have a good algorithm for a single value, extending it to SIMD of any sort should be trivial.

The question calculate atan2 has a reference to Apple's atan.c source. The co-efficients in code are most likely derived from what I have given above. The issue with this code is it does not scale to SIMD well as the atan() approximation is piece-wise and you need different co-efficients depending on the range. For your SIMD, you will need the same co-efficients (multipliers, divisors, equation) through-out the range.

Abramowitz and Stegun: Handbook of Mathematical Functions has a chapter on circular functions with section 4.4.28 giving an logarithmic formulae. This seems to be the similar to the eglibc implementation.

来源：https://stackoverflow.com/questions/18187492/simd-vectorize-atan2-using-arm-neon-assembly

标签

assembly

arm

vectorization

neon

atan2