Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math

前端 未结 2 1306
醉话见心
醉话见心 2020-12-06 09:56

Does anyone know why GCC/Clang will not optimist function test1 in the below code sample to simply use just the RCPPS instruction when using the fast-math option? I

2条回答
  •  -上瘾入骨i
    2020-12-06 10:34

    I was experimenting with a floating point math-heavy hot path in one of my applications, and found something similar. I don't usually look at the instructions emitted by my compiler, so I was a bit surprised and dug into the mathematical details.

    Here's the set of instructions generated by gcc, annotated by me with the carried computation:

    test1(float __vector): ; xmm0               = a
        rcpps   xmm1, xmm0 ; xmm1 = 1 / xmm0    = 1/a
        mulps   xmm0, xmm1 ; xmm0 = xmm0 * xmm1 = a * 1/a
        mulps   xmm0, xmm1 ; xmm0 = xmm0 * xmm1 = a * (1/a)^2
        addps   xmm1, xmm1 ; xmm1 = xmm1 + xmm1 = 2 * (1/a)
        subps   xmm1, xmm0 ; xmm1 = xmm1 - xmm0 = 2 * (1/a) - a * (1/a)^2
        movaps  xmm0, xmm1 ; xmm0 = xmm1        = 2 * (1/a) - a * (1/a)^2
        ret
    

    So what's going on here? Why waste an additional 4 instructions on calculating an expression that is mathematically equivalent to just the reciprocal?

    Well, the rcpps instructions is only an approximate reciprocal. The other arithmetic instructions (mulps, addps, subps) are exact up to single precision. Let's write r(x) for the approximate reciprocal function. The final then becomes y = 2*r(a) - a*r(a)^2. If we substitute r(x) = (1 + eps) * (1/x), with eps being the relative error, we get:

    y = 2 * (1 + eps) * (1/a) - a * (1 + eps)^2 * (1/a)^2
      = (2 + 2*eps - (1 + eps)^2) * (1/a)
      = (2 + 2*eps - (1 + 2*eps + eps^2)) * (1/a)
      = (1 - eps^2) * (1/a)
    

    The relative error of rcpps is less than 1.5 * 2^-12, so eps <= 1.5 * 2^-12, so:

    eps^2 <= 2.25 * 2^-24
          <  1.5  * 2^-23
    

    So by executing these extra instructions we went from 12 bits of precision to 23 bits of precision. Note that a single precision float has 24 bits of precision, so we almost get full precision here.

    So is this just some magical sequence of instructions that happens to get us extra precision? Not quite. It's based on Newton's method (which I gather is referred to as Newton-Raphson by folks who work with assembly a lot).

    Newton's method is a root-finding method. Given some function f(x) it finds approximate solutions to f(x) = 0, by starting with an approximate solution x_0 and iteratively improving upon it. The Newton iteration is given by:

    x_n+1 = x_n - f(x_n) / f'(x_n)
    

    In our case, we can reformulate finding the reciprocal 1/a of a as finding the root of the function f(x) = a*x - 1, with derivative f'(x) = a. Substituting that into the equation for the Newton iteration we get:

    x_n+1 = x_n - (a*x_n - 1) / a
    

    Two observations:

    1. In this case the Newton iteration actually gives us an exact result, rather than just a better approximation. This makes sense, because Newton's method works by making a linear approximation of f around x_n. In this case f is completely linear, so the approximation is perfect. However...

    2. Computing the Newton iteration requires us to divide by a, which is the exact computation we're trying to approximate. This creates a circular problem. We break the cycle by modifying the Newton iteration to use our approximate reciprocal x_n for the division by a:

      x_n+1 =  x_n - (a*x_n - 1) * x_n
            ~= x_n - (a*x_n - 1) / a
      

    This iteration would work just fine, but it's not great from a vector math perspective: it requires subtracting 1. To do so with vector math requires preparing a vector register with a sequence of 1s. This requires an additional instruction and an additional register.

    We can rewrite the iteration to avoid this:

    x_n+1 = x_n - (a*x_n - 1) * x_n
          = x_n - (a*x_n^2 - x_n)
          = 2*x_n - a*x_n^2
    

    Now substitute x_0 = r(a) and we recover our expression from above:

    y = x_1 = 2*r(a) - a*r(a)^2
    

提交回复
热议问题