How to improve fixed point square-root for small values

笑着哭i 提交于 2019-11-27 22:59:49

The original implementation obviously has some problems. I became frustrated with trying to fix them all with the way the code is currently done and ended up going at it with a different approach. I could probably fix the original now, but I like my way better anyway.

I treat the input number as being in Q64 to start which is the same as shifting by 28 and then shifting back by 14 afterwards (the sqrt halves it). However, if you just do that, then the accuracy is limited to 1/2^14 = 6.1035e-5 because the last 14 bits will be 0. To remedy this, I then shift a and remainder correctly and to keep filling in digits I do the loop again. The code can be made more efficient and cleaner, but I'll leave that to someone else. The accuracy shown below is pretty much as good as you can get with Q36.28. If you compare the fixed point sqrt with the floating point sqrt of the input number after it has been truncated by fixed point(convert it to fixed point and back), then the errors are around 2e-9(I didn't do this in the code below, but it requires one line of change). This is right in line with the best accuracy for Q36.28 which is 1/2^28 = 3.7529e-9.

By the way, one big mistake in the original code is that the term where m = 0 is never considered so that bit can never be set. Anyway, here is the code. Enjoy!

#include <iostream>
#include <cmath>

typedef unsigned long uint64_t;

uint64_t sqrt(uint64_t in_val)
{
    const uint64_t fixed_resolution_shift = 28;
    const unsigned max_shift=62;
    uint64_t a_squared=1ULL<<max_shift;
    unsigned b_shift=(max_shift>>1) + 1;
    uint64_t a=1ULL<<(b_shift - 1);

    uint64_t x=in_val;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    uint64_t remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        uint64_t b_squared=1ULL<<(2*(b_shift - 1));
        uint64_t two_a_b=(a<<b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((remainder)>=delta && b_shift)
        {
            a+=(1ULL<<(b_shift - 1));
            remainder-=delta;
            --b_shift;
        }
    }
    a <<= (fixed_resolution_shift/2);
    b_shift = (fixed_resolution_shift/2) + 1;
    remainder <<= (fixed_resolution_shift);

    while(remainder && b_shift)
    {
        uint64_t b_squared=1ULL<<(2*(b_shift - 1));
        uint64_t two_a_b=(a<<b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((remainder)>=delta && b_shift)
        {
            a+=(1ULL<<(b_shift - 1));
            remainder-=delta;
            --b_shift;
        }
    }

    return a;
}

double fixed2float(uint64_t x)
{
    return static_cast<double>(x) * pow(2.0, -28.0);
}

uint64_t float2fixed(double f)
{
    return static_cast<uint64_t>(f * pow(2, 28.0));
}

void finderror(double num)
{
    double root1 = fixed2float(sqrt(float2fixed(num)));
    double root2 = pow(num, 0.5);
    std::cout << "input: " << num << ", fixed sqrt: " << root1 << " " << ", float sqrt: " << root2 << ", finderror: " << root2 - root1 << std::endl;
}

main()
{
    finderror(0);
    finderror(1e-5);
    finderror(2e-5);
    finderror(3e-5);
    finderror(4e-5);
    finderror(5e-5);
    finderror(pow(2.0,1));
    finderror(1ULL<<35);
}

with the output of the program being

input: 0, fixed sqrt: 0 , float sqrt: 0, finderror: 0
input: 1e-05, fixed sqrt: 0.00316207 , float sqrt: 0.00316228, finderror: 2.10277e-07
input: 2e-05, fixed sqrt: 0.00447184 , float sqrt: 0.00447214, finderror: 2.97481e-07
input: 3e-05, fixed sqrt: 0.0054772 , float sqrt: 0.00547723, finderror: 2.43815e-08
input: 4e-05, fixed sqrt: 0.00632443 , float sqrt: 0.00632456, finderror: 1.26255e-07
input: 5e-05, fixed sqrt: 0.00707086 , float sqrt: 0.00707107, finderror: 2.06055e-07
input: 2, fixed sqrt: 1.41421 , float sqrt: 1.41421, finderror: 1.85149e-09
input: 3.43597e+10, fixed sqrt: 185364 , float sqrt: 185364, finderror: 2.24099e-09

Given that sqrt(ab) = sqrt(a)sqrt(b), then can't you just trap the case where your number is small and shift it up by a given number of bits, compute the root and shift that back down by half the number of bits to get the result?

I.e.

 sqrt(n) = sqrt(n.2^k)/sqrt(2^k)
         = sqrt(n.2^k).2^(-k/2)

E.g. Choose k = 28 for any n less than 2^8.

Alexey Frunze

I'm not sure how you're getting the numbers from fixed::sqrt() shown in the table.

Here's what I do:

#include <stdio.h>
#include <math.h>

#define __int64 long long // gcc doesn't know __int64
typedef __int64 fixed;

#define FRACT 28

#define DBL2FIX(x) ((fixed)((double)(x) * (1LL << FRACT)))
#define FIX2DBL(x) ((double)(x) / (1LL << FRACT))

// De-++-ified code from
// http://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html
fixed sqrtfix0(fixed num)
{
    static unsigned const fixed_resolution_shift=FRACT;

    unsigned const max_shift=62;
    unsigned __int64 a_squared=1LL<<max_shift;
    unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
    unsigned __int64 a=1LL<<b_shift;

    unsigned __int64 x=num;

    unsigned __int64 remainder;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        unsigned __int64 b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
        int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
        unsigned __int64 two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
        unsigned __int64 delta;

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        delta=b_squared+two_a_b;
        if((2*remainder)>delta)
        {
            a+=(1LL<<b_shift);
            remainder-=delta;
            if(b_shift)
            {
                --b_shift;
            }
        }
    }
    return (fixed)a;
}

// Adapted code from
// http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation
fixed sqrtfix1(fixed num)
{
    fixed res = 0;
    fixed bit = (fixed)1 << 62; // The second-to-top bit is set
    int s = 0;

    // Scale num up to get more significant digits

    while (num && num < bit)
    {
        num <<= 1;
        s++;
    }

    if (s & 1)
    {
        num >>= 1;
        s--;
    }

    s = 14 - (s >> 1);

    while (bit != 0)
    {
        if (num >= res + bit)
        {
            num -= res + bit;
            res = (res >> 1) + bit;
        }
        else
        {
            res >>= 1;
        }

        bit >>= 2;
    }

    if (s >= 0) res <<= s;
    else res >>= -s;

    return res;
}

int main(void)
{
    double testData[] =
    {
        0,
        1e-005,
        2e-005,
        3e-005,
        4e-005,
        5e-005,
        6e-005,
        7e-005,
        8e-005,
    };
    int i;

    for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
    {
        double x = testData[i];
        fixed xf = DBL2FIX(x);

        fixed sqf0 = sqrtfix0(xf);
        fixed sqf1 = sqrtfix1(xf);

        double sq0 = FIX2DBL(sqf0);
        double sq1 = FIX2DBL(sqf1);

        printf("%10.8f:  "
               "sqrtfix0()=%10.8f / err=%e  "
               "sqrt()=%10.8f  "
               "sqrtfix1()=%10.8f / err=%e\n",
               x,
               sq0, fabs(sq0 - sqrt(x)),
               sqrt(x),
               sq1, fabs(sq1 - sqrt(x)));
    }

    printf("sizeof(double)=%d\n", (int)sizeof(double));

    return 0;
}

And here's what I get (with gcc and Open Watcom):

0.00000000:  sqrtfix0()=0.00003052 / err=3.051758e-05  sqrt()=0.00000000  sqrtfix1()=0.00000000 / err=0.000000e+00
0.00001000:  sqrtfix0()=0.00311279 / err=4.948469e-05  sqrt()=0.00316228  sqrtfix1()=0.00316207 / err=2.102766e-07
0.00002000:  sqrtfix0()=0.00445557 / err=1.656955e-05  sqrt()=0.00447214  sqrtfix1()=0.00447184 / err=2.974807e-07
0.00003000:  sqrtfix0()=0.00543213 / err=4.509667e-05  sqrt()=0.00547723  sqrtfix1()=0.00547720 / err=2.438148e-08
0.00004000:  sqrtfix0()=0.00628662 / err=3.793423e-05  sqrt()=0.00632456  sqrtfix1()=0.00632443 / err=1.262553e-07
0.00005000:  sqrtfix0()=0.00701904 / err=5.202484e-05  sqrt()=0.00707107  sqrtfix1()=0.00707086 / err=2.060551e-07
0.00006000:  sqrtfix0()=0.00772095 / err=2.501943e-05  sqrt()=0.00774597  sqrtfix1()=0.00774593 / err=3.390476e-08
0.00007000:  sqrtfix0()=0.00836182 / err=4.783859e-06  sqrt()=0.00836660  sqrtfix1()=0.00836649 / err=1.086198e-07
0.00008000:  sqrtfix0()=0.00894165 / err=2.621519e-06  sqrt()=0.00894427  sqrtfix1()=0.00894409 / err=1.777289e-07
sizeof(double)=8

EDIT:

I've missed the fact that the above sqrtfix1() won't work well with large arguments. It can be fixed by appending 28 zeroes to the argument and essentially calculating the exact integer square root of that. This comes at the expense of doing internal calculations in 128-bit arithmetic, but it's pretty straightforward:

fixed sqrtfix2(fixed num)
{
    unsigned __int64 numl, numh;
    unsigned __int64 resl = 0, resh = 0;
    unsigned __int64 bitl = 0, bith = (unsigned __int64)1 << 26;

    numl = num << 28;
    numh = num >> (64 - 28);

    while (bitl | bith)
    {
        unsigned __int64 tmpl = resl + bitl;
        unsigned __int64 tmph = resh + bith + (tmpl < resl);

        tmph = numh - tmph - (numl < tmpl);
        tmpl = numl - tmpl;

        if (tmph & 0x8000000000000000ULL)
        {
            resl >>= 1;
            if (resh & 1) resl |= 0x8000000000000000ULL;
            resh >>= 1;
        }
        else
        {
            numl = tmpl;
            numh = tmph;

            resl >>= 1;
            if (resh & 1) resl |= 0x8000000000000000ULL;
            resh >>= 1;

            resh += bith + (resl + bitl < resl);
            resl += bitl;
        }

        bitl >>= 2;
        if (bith & 1) bitl |= 0x4000000000000000ULL;
        if (bith & 2) bitl |= 0x8000000000000000ULL;
        bith >>= 2;
    }

    return resl;
}

And it gives pretty much the same results (slightly better for 3.43597e+10) than this answer:

0.00000000:  sqrtfix0()=0.00003052 / err=3.051758e-05  sqrt()=0.00000000  sqrtfix2()=0.00000000 / err=0.000000e+00
0.00001000:  sqrtfix0()=0.00311279 / err=4.948469e-05  sqrt()=0.00316228  sqrtfix2()=0.00316207 / err=2.102766e-07
0.00002000:  sqrtfix0()=0.00445557 / err=1.656955e-05  sqrt()=0.00447214  sqrtfix2()=0.00447184 / err=2.974807e-07
0.00003000:  sqrtfix0()=0.00543213 / err=4.509667e-05  sqrt()=0.00547723  sqrtfix2()=0.00547720 / err=2.438148e-08
0.00004000:  sqrtfix0()=0.00628662 / err=3.793423e-05  sqrt()=0.00632456  sqrtfix2()=0.00632443 / err=1.262553e-07
0.00005000:  sqrtfix0()=0.00701904 / err=5.202484e-05  sqrt()=0.00707107  sqrtfix2()=0.00707086 / err=2.060551e-07
0.00006000:  sqrtfix0()=0.00772095 / err=2.501943e-05  sqrt()=0.00774597  sqrtfix2()=0.00774593 / err=3.390476e-08
0.00007000:  sqrtfix0()=0.00836182 / err=4.783859e-06  sqrt()=0.00836660  sqrtfix2()=0.00836649 / err=1.086198e-07
0.00008000:  sqrtfix0()=0.00894165 / err=2.621519e-06  sqrt()=0.00894427  sqrtfix2()=0.00894409 / err=1.777289e-07
2.00000000:  sqrtfix0()=1.41419983 / err=1.373327e-05  sqrt()=1.41421356  sqrtfix2()=1.41421356 / err=1.851493e-09
34359700000.00000000:  sqrtfix0()=185363.69654846 / err=5.097361e-06  sqrt()=185363.69655356  sqrtfix2()=185363.69655356 / err=1
.164153e-09

Many many years ago I worked on a demo program for a small computer our outfit had built. The computer had a built-in square-root instruction, and we built a simple program to demonstrate the computer doing 16-bit add/subtract/multiply/divide/square-root on a TTY. Alas, it turned out that there was a serious bug in the square root instruction, but we had promised to demo the function. So we created an array of the squares of the values 1-255, then used a simple lookup to match the value typed in to one of the array values. The index was the square root.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!