Fixed point approximation of 2^x, with input range of s5.26

流过昼夜 提交于 2019-12-24 08:03:54

问题


How can I implement 2^x fixed-point arithmetic s5.26 and input values is in range [-31.9, 31.9] using the minimax polynomial approximation for exp2() How to generate the polynomial using Sollya Tool mentioned in the following link Power of 2 approximation in fixed point


回答1:


Since fixed-point arithmetic generally does not include an "infinity" encoding representing overflowed results, any implementation of exp2() for an s5.26 format will be limited to inputs in the interval (-32, 5), resulting in outputs in [0, 32).

The computation of transcendental functions typically consist of argument reduction, core approximation, final result construction. In the case of exp2(a), a reasonable argument reduction scheme is to split a into integer part i and fractional part f, such that a == i + f, with f in [-0.5, 0.5]. One then computes exp2(f), and scales the result by 2i, which corresponds to shifts in fixed-point arithmetic: exp2(a) = exp2(f) * exp2(i).

The common design choices for the computation of exp2(f) are interpolation in tabulated values of exp2(), or polynomial approximation. Since we need 31 result bits for the largest arguments, accurate interpolation would probably want to use quadratic interpolation to keep the table size reasonable. Since many modern processors (including ones used in embedded systems) provide a fast integer multiplier, I will focus here on approximation by polynomial. For this, we want a polynomial with minimax properties, that is, one that minimizes the maximum error compared to the reference.

Both commercial and free tools offer built-in capabilities to generate minimax approximations, e.g. Mathematica's MiniMaxApproximation command, Maple's minimax command, and Sollya's fpminimax command. One might also chose to build one's own infrastructure based on the Remez algorithm, which is the approach I have used. As opposed to floating-point arithmetic which typically uses to-nearest-or-even rounding, fixed-point arithmetic is usually restricted to truncation of intermediate results. This adds additional error during expression evaluation. As a consequence, it is usually a good idea to try a heuristic-based search for small adjustments to the coefficients of the generated approximation to partially balance those accumulating one-sided errors.

Because we need up to 31 bits in the result, and because coefficients in core approximations are typically less than unity in magnitude, we cannot use the native fixed-point precision, here s5.26, for polynomial evaluation. Instead, we want to scale up the operands in intermediate computation to fully use the available range of 32-bit integers, by dynamically adjusting the fixed-point format we are working in. For reasons of efficiency, it seems advisable to arrange the computation such that multiplications use re-normalization right shifts by 32 bits. This will often allow the elimination of explicit shifts on 32-bit processors.

Since intermediate computation uses signed data, right shifts of signed, negative operands will occur. We want those right shifts to map to arithmetic right shift instructions, something the C standard does not guarantee. But on most commonly used platforms, C compilers do what is desirable for us. Otherwise, it may be necessary to resort to intrinsics or inline assembly. I developed the code below with the Microsoft compiler on an x64 platform.

In the evaluation of the polynomial approximation for exp2(f) the original floating-point coefficients, the dynamic scaling, and the heuristic adjustments are all clearly visible. The code below does not quite achieve full accuracy for large arguments. The biggest absolute error is 1.10233e-7, for the argument of 0x12de9c5b = 4.71739332: fixed_exp2() returns 0x693ab6a3 while the accurate result would be 0x693ab69c. Presumably full accuracy could be achieved by increasing the degree of the polynomial core approximation by one.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

/* on 32-bit architectures, there is often an instruction/intrinsic for this */
int32_t mulhi (int32_t a, int32_t b)
{
    return (int32_t)(((int64_t)a * (int64_t)b) >> 32);
}

/* compute exp2(a) in s5.26 fixed-point arithmetic */
int32_t fixed_exp2 (int32_t a)
{
    int32_t i, f, r, s;
    /* split a = i + f, such that f in [-0.5, 0.5] */
    i = (a + 0x2000000) & ~0x3ffffff; // 0.5
    f = a - i;   
    s = ((5 << 26) - i) >> 26;
    f = f << 5; /* scale up for maximum accuracy in intermediate computation */
    /* approximate exp2(f)-1 for f in [-0.5, 0.5] */
    r =                (int32_t)(1.53303146e-4 * (1LL << 36) + 996);
    r = mulhi (r, f) + (int32_t)(1.33887795e-3 * (1LL << 35) +  99);
    r = mulhi (r, f) + (int32_t)(9.61833261e-3 * (1LL << 34) + 121);
    r = mulhi (r, f) + (int32_t)(5.55036329e-2 * (1LL << 33) +  51);
    r = mulhi (r, f) + (int32_t)(2.40226507e-1 * (1LL << 32) +   8);
    r = mulhi (r, f) + (int32_t)(6.93147182e-1 * (1LL << 31) +   5);
    r = mulhi (r, f);
    /* add 1, scale based on integral portion of argument, round the result */
    r = ((((uint32_t)r * 2) + (uint32_t)(1.0*(1LL << 31)) + ((1U << s) / 2) + 1) >> s);
    /* when argument < -26.5, result underflows to zero */
    if (a < -0x6a000000) r = 0;
    return r;
}

/* convert from s5.26 fixed point to double-precision floating point */
double fixed_to_float (int32_t a)
{
    return a / 67108864.0;
}

int main (void)
{
    double a, res, ref, err, maxerr = 0.0;
    int32_t x, start, end;

    start = -0x7fffffff; // -31.999999985
    end =    0x14000000; //   5.000000000
    printf ("testing fixed_exp2 with inputs in [%.9f, %.9f)\n",  
            fixed_to_float (start), fixed_to_float (end));

    for (x = start; x < end; x++) {
        a = fixed_to_float (x);
        ref = exp2 (a);
        res = fixed_to_float (fixed_exp2 (x));
        err = fabs (res - ref);
        if (err > maxerr) {
            maxerr = err;
        }
    }
    printf ("max. abs. err = %g\n", maxerr);

    return EXIT_SUCCESS;
}

A table-based alternative would trade-off table storage for a reduction in the amount of computation that is performed. Depending on the size of the L1 data cache, this may or may not increase performance. One possible approach is to tabulate 2f-1 for f in [0, 1). The split the function argument into an integer i and a fraction f, such that f in [0, 1). In order to keep the table reasonably small, use quadratic interpolation, with the coefficients of the polynomial computed on the fly from three consecutive table entries. The result is slightly adjusted by a heuristically determined offset to somewhat compensate for the truncating nature of fixed-point arithmetic.

The table is indexed by leading bits of the fraction f. Using seven bits for the index (resulting in a table of 128+2 entries), accuracy is slightly worse than with the previous minimax polynomial approximation. Maximum absolute error is 1.74935e-7. It occurs for an argument of 0x11580000 = 4.33593750, where fixed_exp2() returns 0x50c7d771, whereas the accurate result would be 0x50c7d765.

/* For i in [0,129]: (exp2 (i/128.0) - 1.0) * (1 << 31) */
static const uint32_t expTab [130] =
{
    0x00000000, 0x00b1ed50, 0x0164d1f4, 0x0218af43,
    0x02cd8699, 0x0383594f, 0x043a28c4, 0x04f1f656,
    0x05aac368, 0x0664915c, 0x071f6197, 0x07db3580,
    0x08980e81, 0x0955ee03, 0x0a14d575, 0x0ad4c645,
    0x0b95c1e4, 0x0c57c9c4, 0x0d1adf5b, 0x0ddf0420,
    0x0ea4398b, 0x0f6a8118, 0x1031dc43, 0x10fa4c8c,
    0x11c3d374, 0x128e727e, 0x135a2b2f, 0x1426ff10,
    0x14f4efa9, 0x15c3fe87, 0x16942d37, 0x17657d4a,
    0x1837f052, 0x190b87e2, 0x19e04593, 0x1ab62afd,
    0x1b8d39ba, 0x1c657368, 0x1d3ed9a7, 0x1e196e19,
    0x1ef53261, 0x1fd22825, 0x20b05110, 0x218faecb,
    0x22704303, 0x23520f69, 0x243515ae, 0x25195787,
    0x25fed6aa, 0x26e594d0, 0x27cd93b5, 0x28b6d516,
    0x29a15ab5, 0x2a8d2653, 0x2b7a39b6, 0x2c6896a5,
    0x2d583eea, 0x2e493453, 0x2f3b78ad, 0x302f0dcc,
    0x3123f582, 0x321a31a6, 0x3311c413, 0x340aaea2,
    0x3504f334, 0x360093a8, 0x36fd91e3, 0x37fbefcb,
    0x38fbaf47, 0x39fcd245, 0x3aff5ab2, 0x3c034a7f,
    0x3d08a39f, 0x3e0f680a, 0x3f1799b6, 0x40213aa2,
    0x412c4cca, 0x4238d231, 0x4346ccda, 0x44563ecc,
    0x45672a11, 0x467990b6, 0x478d74c9, 0x48a2d85d,
    0x49b9bd86, 0x4ad2265e, 0x4bec14ff, 0x4d078b86,
    0x4e248c15, 0x4f4318cf, 0x506333db, 0x5184df62,
    0x52a81d92, 0x53ccf09a, 0x54f35aac, 0x561b5dff,
    0x5744fccb, 0x5870394c, 0x599d15c2, 0x5acb946f,
    0x5bfbb798, 0x5d2d8185, 0x5e60f482, 0x5f9612df,
    0x60ccdeec, 0x62055b00, 0x633f8973, 0x647b6ca0,
    0x65b906e7, 0x66f85aab, 0x68396a50, 0x697c3840,
    0x6ac0c6e8, 0x6c0718b6, 0x6d4f301f, 0x6e990f98,
    0x6fe4b99c, 0x713230a8, 0x7281773c, 0x73d28fde,
    0x75257d15, 0x767a416c, 0x77d0df73, 0x792959bb,
    0x7a83b2db, 0x7bdfed6d, 0x7d3e0c0d, 0x7e9e115c,
    0x80000000, 0x8163daa0
};

int32_t fixed_exp2 (int32_t x)
{
    int32_t f1, f2, dx, a, b, approx, idx, i, f;

    /* extract integer portion; 2**i is realized as a shift at the end */
    i = (x >> 26);
    /* extract fraction f so we can compute 2^f, 0 <= f < 1 */
    f = x & 0x3ffffff;
    /* index table of exp2 values using 7 most significant bits of fraction */
    idx = (uint32_t)f >> (26 - 7);
    /* difference between argument and next smaller sampling point */
    dx = f - (idx << (26 - 7));
    /* fit parabola through closest 3 sampling points; find coefficients a,b */
    f1 = (expTab[idx+1] - expTab[idx]);
    f2 = (expTab[idx+2] - expTab[idx]);
    a = f2 - (f1 << 1);
    b = (f1 << 1) - a;
    /* find function value offset for argument x by computing ((a*dx+b)*dx) */
    approx = a;
    approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7)) + b;
    approx = (int32_t)((((int64_t)approx)*dx) >> (26 - 7 + 1));
    /* combine integer and fractional parts of result, round result */
    approx = (((expTab[idx] + (uint32_t)approx + (uint32_t)(1.0*(1LL << 31)) + 22U) >> (30 - 26 - i)) + 1) >> 1;
    /* flush underflow to 0 */
    if (i < -27) approx = 0;
    return approx;
}


来源:https://stackoverflow.com/questions/53736820/fixed-point-approximation-of-2x-with-input-range-of-s5-26

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!