Optimizations for pow() with const non-integer exponent?

前端 未结 10 664
旧时难觅i
旧时难觅i 2020-12-04 09:17

I have hot spots in my code where I\'m doing pow() taking up around 10-20% of my execution time.

My input to pow(x,y) is very specific, so

10条回答
  •  余生分开走
    2020-12-04 09:37

    So traditionally the powf(x, p) = x^p is solved by rewriting x as x=2^(log2(x)) making powf(x,p) = 2^(p*log2(x)), which transforms the problem into two approximations exp2() & log2(). This has the advantage of working with larger powers p, however the downside is that this is not the optimal solution for a constant power p and over a specified input bound 0 ≤ x ≤ 1.

    When the power p > 1, the answer is a trivial minimax polynomial over the bound 0 ≤ x ≤ 1, which is the case for p = 12/5 = 2.4 as can be seen below:

    float pow12_5(float x){
        float mp;
        // Minimax horner polynomials for x^(5/12), Note: choose the accurarcy required then implement with fma() [Fused Multiply Accumulates]
        // mp = 0x4.a84a38p-12 + x * (-0xd.e5648p-8 + x * (0xa.d82fep-4 + x * 0x6.062668p-4)); // 1.13705697e-3
        mp = 0x1.117542p-12 + x * (-0x5.91e6ap-8 + x * (0x8.0f50ep-4 + x * (0xa.aa231p-4 + x * (-0x2.62787p-4))));  // 2.6079002e-4
        // mp = 0x5.a522ap-16 + x * (-0x2.d997fcp-8 + x * (0x6.8f6d1p-4 + x * (0xf.21285p-4 + x * (-0x7.b5b248p-4 + x * 0x2.32b668p-4))));  // 8.61377e-5
        // mp = 0x2.4f5538p-16 + x * (-0x1.abcdecp-8 + x * (0x5.97464p-4 + x * (0x1.399edap0 + x * (-0x1.0d363ap0 + x * (0xa.a54a3p-4 + x * (-0x2.e8a77cp-4))))));  // 3.524655e-5
        return(mp);
    }
    

    However when p < 1 the minimax approximation over the bound 0 ≤ x ≤ 1 does not appropriately converge to the desired accuracy. One option [not really] is to rewrite the problem y=x^p=x^(p+m)/x^m where m=1,2,3 is a positive integer, making the new power approximation p > 1 but this introduces division which is inherently slower.

    There's however another option which is to decompose the input x as its floating point exponent and mantissa form:

    x = mx* 2^(ex) where 1 ≤ mx < 2
    y = x^(5/12) = mx^(5/12) * 2^((5/12)*ex), let ey = floor(5*ex/12), k = (5*ex) % 12
      = mx^(5/12) * 2^(k/12) * 2^(ey)
    

    The minimax approximation of mx^(5/12) over 1 ≤ mx < 2 now converges much faster than before, without division, but requires 12 point LUT for the 2^(k/12). The code is below:

    float powk_12LUT[] = {0x1.0p0, 0x1.0f38fap0, 0x1.1f59acp0,  0x1.306fep0, 0x1.428a3p0, 0x1.55b81p0, 0x1.6a09e6p0, 0x1.7f910ep0, 0x1.965feap0, 0x1.ae89fap0, 0x1.c823ep0, 0x1.e3437ep0};
    float pow5_12(float x){
        union{float f; uint32_t u;} v, e2;
        float poff, m, e, ei;
        int xe;
    
        v.f = x;
        xe = ((v.u >> 23) - 127);
    
        if(xe < -127) return(0.0f);
    
        // Calculate remainder k in 2^(k/12) to find LUT
        e = xe * (5.0f/12.0f);
        ei = floorf(e);
        poff = powk_12LUT[(int)(12.0f * (e - ei))];
    
        e2.u = ((int)ei + 127) << 23;   // Calculate the exponent
        v.u = (v.u & ~(0xFFuL << 23)) | (0x7FuL << 23); // Normalize exponent to zero
    
        // Approximate mx^(5/12) on [1,2), with appropriate degree minimax
        // m = 0x8.87592p-4 + v.f * (0x8.8f056p-4 + v.f * (-0x1.134044p-4));    // 7.6125e-4
        // m = 0x7.582138p-4 + v.f * (0xb.1666bp-4 + v.f * (-0x2.d21954p-4 + v.f * 0x6.3ea0cp-8));  // 8.4522726e-5
        m = 0x6.9465cp-4 + v.f * (0xd.43015p-4 + v.f * (-0x5.17b2a8p-4 + v.f * (0x1.6cb1f8p-4 + v.f * (-0x2.c5b76p-8))));   // 1.04091259e-5
        // m = 0x6.08242p-4 + v.f * (0xf.352bdp-4 + v.f * (-0x7.d0c1bp-4 + v.f * (0x3.4d153p-4 + v.f * (-0xc.f7a42p-8 + v.f * 0x1.5d840cp-8))));    // 1.367401e-6
    
        return(m * poff * e2.f);
    }
    

提交回复
热议问题