Fast 1/X division (reciprocal)

后端未结

关注

 6  1278

忘掉有多难

Is there some way to improve reciprocal (division 1 over X) with respect to speed, if the precision is not crucial?

So, I need to calculate 1/X. Is there so

相关标签:

6条回答

盖世英雄少女心

2020-12-29 23:11

First of all, if you turn on compiler optimizations, the compiler is likely to optimize the calculation if possible (to pull it out of a loop, for example). To see this optimization, you need to build and run in Release mode.

Division may be heavier than multiplication (but a commenter pointed out that reciprocals are just as fast as multiplication on modern CPUs, in which case, this isn't correct for your case), so if you do have 1/X appearing somewhere inside a loop (and more than once), you can assist by caching the result inside the loop (float Y = 1.0f/X;) and then using Y. (The compiler optimization might do this in any case.)

Also, certain formulas can be redesigned to remove division or other inefficient computations. For that, you could post the larger computation being performed. Even there, the program or algorithm itself can sometimes be restructured to prevent the need for hitting time-consuming loops from being hit as frequently.

How much accuracy can be sacrificed? If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.

However, in general, there's no way to speed up division. If there were, compilers would already be doing it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2020-12-29 23:11

The fastest way that I know of is to use SIMD operations. http://msdn.microsoft.com/en-us/library/796k1tty(v=vs.90).aspx

0 讨论(0)
发布评论:

提交评论
- 加载中...

青春惊慌失措

2020-12-29 23:19

This should do it with a number of pre-unrolled newton iterations's evaluated as a Horner polynomial which uses fused-multiply accumulate operations most modern day CPU's execute in a single Clk cycle (every time):

float inv_fast(float x) {
    union { float f; int i; } v;
    float w, sx;
    int m;

    sx = (x < 0) ? -1:1;
    x = sx * x;

    v.i = (int)(0x7EF127EA - *(uint32_t *)&x);
    w = x * v.f;

    // Efficient Iterative Approximation Improvement in horner polynomial form.
    v.f = v.f * (2 - w);     // Single iteration, Err = -3.36e-3 * 2^(-flr(log2(x)))
    // v.f = v.f * ( 4 + w * (-6 + w * (4 - w)));  // Second iteration, Err = -1.13e-5 * 2^(-flr(log2(x)))
    // v.f = v.f * (8 + w * (-28 + w * (56 + w * (-70 + w *(56 + w * (-28 + w * (8 - w)))))));  // Third Iteration, Err = +-6.8e-8 *  2^(-flr(log2(x)))

    return v.f * sx;
}

Fine Print: Closer to 0, the approximation does not do so well so either you the programmer needs to test the performance or restrict the input from getting to low before resorting to hardware division. i.e. be responsible!

0 讨论(0)

梦谈多话

2020-12-29 23:28
I’ve bench tested these methods on Arduino NANO for speed and 'accuracy'.
Basic calculation was to setup variables, Y = 15,000,000 and Z = 65,535
(in my real case, Y is a constant and Z can vary between 65353 and 3000 so a useful test)
Calc time on Arduino was established by putting pin low, then high as calc made and then low again and comparing on times with logic analyser. FOR 100 CYCLES. With variables as unsigned integers:-
```
Y * Z takes 0.231 msec
Y / Z takes  3.867 msec.  
With variables as floats:-  
Y * Z takes  1.066 msec
Y / Z takes  4.113 msec.  
Basic Bench Mark  and ( 15,000,000/65535 = 228.885 via calculator.) 
```
Using {Jack Giffin’s} float reciprocal algorithm:
```
Y * reciprocal(Z)  takes  1.937msec  which is a good improvement, but accuracy less so 213.68.  
```
Using {nimig18’s} float inv_fast:
```
Y* inv_fast(Z)  takes  5.501 msec  accuracy 228.116  with single iteration  
Y* inv_fast(Z)  takes  7.895 msec  accuracy 228.883  with second iteration 
```
Using Wikipedia's Q_rsqrt (pointed to by {Jack Giffin})
```
Y * Q*rsqrt(Z) takes  6.104 msec  accuracy   228.116  with single iteration  
All entertaining but ultimately disappointing!
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-12-29 23:29

0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2020-12-29 23:35

First, make sure this isn't a case of premature optimization. Do you know that this is your bottleneck?

As Mystical says, 1/x can be calculated very quickly. Make sure you're not using double datatype for either the 1 or the divisor. Floats are much faster.

That said, benchmark, benchmark, benchmark. Don't waste your time spending hours on numerical theory just to discover the source of the poor performance is IO access.

0 讨论(0)
发布评论:

提交评论
- 加载中...