Removing slow int64 division from fixed point atan2() approximation

问题

I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?

I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.

Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:

int32_t fpatan2(int32_t y, int32_t x)       // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
    #include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
    const uint32_t outfmt = 32; // final output format in s0.outfmt
    const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp);   // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
    const uint32_t halfof = 1UL<<(outfmt-1);    // represents 0.5 in the output format, which since it is in turns means half a circle
    const uint32_t pds=ds-lutsp;    // division shift and post-division shift
    uint32_t lutind, p, t, d;
    int32_t a, b, xa, ya, xs, ys, div, r;

    xs = x >> 31;       // equivalent of fabs()
    xa = (x^xs) - xs;
    ys = y >> 31;
    ya = (y^ys) - ys;

    d = ya+xa;
    if (d==0)       // if both y and x are 0 then they add up to 0 and we must return 0
        return 0;

    // the following does 0.5 * (1. - (y-x) / (y+x))
    // (y+x) is u1.31, (y-x) is s0.31, div is in s1.29

    div = ((int64_t) (ya-xa)<<ds) / d;  // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
    p = ((1UL<<ds) - div) >> 1;     // before shift the format is s2.29. position in u1.29

    lutind = p >> ish;      // index for the LUT
    t = (p & tmask) >> tbd;     // interpolator between two LUT entries

    a = fpatan_lut[lutind];
    b = fpatan_lut[lutind+1];
    r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip);  // linear interpolation of a and b by t in s0.32 format

    // Quadrants
    if (xs)             // if x was negative
        r = halfof - r;     // r = 0.5 - r

    r = (r^ys) - ys;        // if y was negative then r is negated

    return r;
}

回答1:

Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.

If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.

回答2:

Based on @Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.

I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.

Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:

#include "ffo_lut.h"

static INLINE int32_t log2_ffo32(uint32_t x)    // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
    int32_t y;

    y = x>>21;  if (y)  return ffo_lut[y]+21;
    y = x>>10;  if (y)  return ffo_lut[y]+10;
    return ffo_lut[x];
}

// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt)   // ~23.3 cycles, max error (by division) 53.39e-9
{
    #include "fpdiv.h"  // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
    const int32_t *c;
    int32_t xa, xs, p, sh;
    uint32_t expon, frx, lutind;
    const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1);  // the shift for the index, the shift for 31-bit xa, the value of 0.5
    int64_t out;
    int64_t c0, c1, c2;

    // turn x into xa (|x|) and sign of x (xs)
    xs = x >> 31;
    xa = (x^xs) - xs;

    // decompose |x| into frx * 2^expon
    expon = log2_ffo32(xa);
    frx = (xa << (31-expon)) >> cfs;    // the fractional part is now in 0.27 format

    // lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
    lutind = (frx - half) >> ish;       // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
    lutind *= 3;                // 3 entries for each index
    c = &fpdiv_lut[lutind];         // c points to the correct c0, c1, c2
    c0 = c[0];    c1 = c[1];    c2 = c[2];
    p = (int64_t) frx * frx >> prec;    // x^2
    p = c2 * p >> prec;         // c2 * x^2
    p += c1 * frx >> prec;          // + c1 * x
    p += c0;                // + c0, p = (1.0 , 2.0] in 2.27 format

    // apply the necessary bit shifts and reapplies the original sign of x to make final result
    sh = expon + prec - outfmt;     // calculates the final needed shift
    out = (int64_t) y * p;          // format is s31 + 1.27 = s32.27
    if (sh >= 0)
        out >>= sh;
    else
        out <<= -sh;
    out = (out^xs) - xs;            // if x was negative then out is negated

    return out;
}

I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.

As for the fpatan2() question the solution would be to replace this line:

div = ((int64_t) (ya-xa)<<ds) / d;

with that line:

div = fpdiv(ya-xa, d, ds);

回答3:

Yours time hog instruction:

div = ((int64_t) (ya-xa)<<ds) / d;

exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:

  21:   48 89 c2                mov    %rax,%rdx
  24:   48 c1 fa 3f             sar    $0x3f,%rdx ## this is sign bit extension
  28:   48 f7 fe                idiv   %rsi

Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:

uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
    [a] "=a" (div1) :
    "[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
    "cc");

this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:

  18:   48 98                   cltq   
  1a:   48 c1 e0 1d             shl    $0x1d,%rax
  1e:   48 89 c2                mov    %rax,%rdx
  21:   48 c1 ea 20             shr    $0x20,%rdx
  27:   f7 f9                   idiv   %ecx

This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.

More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:

uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
    [a] "=a" (div1) :
    "[a]" (lowpart), "d" (highpart), [d] "q" (d) :
    "cc");

which results in:

  18:   89 d0                   mov    %edx,%eax
  1a:   c1 e0 1d                shl    $0x1d,%eax
  1d:   c1 ea 03                shr    $0x3,%edx
  22:   f7 f9                   idiv   %ecx

but this is minor fix, compared to the division-related one.

To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.

回答4:

Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.

Assuming constants defined for pi abd pi/2:

int32_t fpatan2( int32_t y, int32_t x) { fixed theta ;

if( x == 0 )
{
    theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
    theta = fpatan( y / x ) ;
    if( x < 0 )
    {
        theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
    }
}

return theta ;

}

Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.

来源：https://stackoverflow.com/questions/25585066/removing-slow-int64-division-from-fixed-point-atan2-approximation

标签

optimization

fixed-point

integer-division

approximation