Safe Floating Point Division

允我心安 提交于 2019-12-22 11:38:11

问题


I have some places in my code where I want to assure that a division of 2 arbitrary floating point numbers (32 bit single precision) won't overflow. The target/compiler does not guarantee (explicitly enough) nice handling of -INF/INF and (does not fully guarantees IEEE 754 for the exceptional values - (possibly undefined) - and target might change). Also I cannot make save assumtions on the inputs for this few special places and I am bound to C90 standard libraries.

I have read What Every Computer Scientist Should Know About Floating-Point Arithmetic but to be honest, I am a little bit lost.

So... I want to ask the community, if the following piece of code would do the trick, and if there are better/faster/exacter/correcter ways to do it:

#define SIGN_F(val) ((val >= 0.0f)? 1.0f : -1.0f)

float32_t safedivf(float32_t num, float32_t denum)
{
   const float32_t abs_denum = fabs(denum);
   if((abs_denum < 1.0f) && ((abs_denum * FLT_MAX) <= (float32_t)fabs(num))
       return SIGN_F(denum) * SIGN_F(num) * FLT_MAX;
   else
       return num / denum;
}

Edit: Changed ((abs_denum * FLT_MAX) < (float32_t)fabs(num)) to ((abs_denum * FLT_MAX) <= (float32_t)fabs(num)) as recommeded by Pascal Cuoq.


回答1:


In ((abs_denum * FLT_MAX) < (float32_t)fabs(num), the product abs_denum * FLT_MAX may round down and end up equal to fabs(num). This does not mean that num / denum is not mathematically larger than FLT_MAX, and you should be worried that it might happen to cause the overflow that you want to avoid. You had better replace this < by <=.


For an alternative solution, if a double type is available and is wider than float, it may be more economical to compute (double)num/(double)denum. If float is binary32ish and double is binary64ish, the only way the double division can overflow is if denum is (a) zero (and if denum is a zero your code is also problematic).

double dbl_res = (double)num/(double)denum;
float res = dbl_res < -FLT_MAX ? -FLT_MAX : dbl_res > FLT_MAX ? FLT_MAX : (float)dbl_res;



回答2:


You may try to extract the exponents and the mantissas of num and denum, and make sure that condition:

((exp(num) - exp (denum)) > max_exp) &&  (mantissa(num) >= mantissa(denum))

And according to the sign of the inputs, generate the corresponding INF.




回答3:


Carefully work with num, denom when the quotient is near FLT_MAX.

The following uses tests inspired by OP but stays away from results near FLT_MAX. As @Pascal Cuoq points out that rounding may just push the result over the edge. Instead it uses thresholds of FLT_MAX/FLT_RADIX and FLT_MAX*FLT_RADIX.

By scaling with FLT_RADIX, typically 2, code should always get exact results. Rounding under any rounding mode is not expected to infect the result.

In terms of speed, the "happy path", that is, when results certainly do not overflow should be a speedy calculation. Still need to do unit testing, but the comments should provide the gist of this approach.

static int SD_Sign(float x) {
  if (x > 0.0f)
    return 1;
  if (x < 0.0f)
    return -1;
  if (atan2f(x, -1.0f) > 0.0f)
    return 1;
  return -1;
}

static float SD_Overflow(float num, float denom) {
  return SD_Sign(num) * SD_Sign(denom) * FLT_MAX;
}

float safedivf(float num, float denom) {
  float abs_denom = fabsf(denom);
  // If |quotient| > |num|
  if (abs_denom < 1.0f) {
    float abs_num = fabsf(num);
    // If |num/denom| > FLT_MAX/2 --> quotient is very large or overflows
    // This computation is safe from rounding and overflow.
    if (abs_num > FLT_MAX / FLT_RADIX * abs_denom) {
      // If |num/denom| >= FLT_MAX*2 --> overflow
      // This also catches denom == 0.0
      if (abs_num / FLT_RADIX >= FLT_MAX * abs_denom) {
        return SD_Overflow(num, denom);
      }
      // At this point, quotient must be in or near range FLT_MAX/2 to FLT_MAX*2
      // Scale parameters so quotient is a FLT_RADIX * FLT_RADIX factor smaller.
      if (abs_num > 1.0) {
        abs_num /= FLT_RADIX * FLT_RADIX;
      } else {
        abs_denom *= FLT_RADIX * FLT_RADIX;
      }
      float quotient = abs_num / abs_denom;
      if (quotient > FLT_MAX / (FLT_RADIX * FLT_RADIX)) {
        return SD_Overflow(num, denom);
      }
    }
  }
  return num / denom;
}

The SIGN_F() needs to consider in denum is +0.0 or -0.0. Various methods mentioned by @Pascal Cuoq in a comment:

  1. copysign() or signbit()
  2. Use a union

Additional, some functions, when well implemented, differentiate on +/- zero like atan2f(zero, -1.0) and sprintf(buffer, "%+f", zero).

Note: Used float vs. float32_t for simplicity.
Note: Maybe use fabsf() rather than fabs().
Minor: Suggest denom (denominator) in lieu of denum.




回答4:


To avoid the corner cases with rounding and what not, you could massage the exponent on the divisor -- with frexp() and ldexp() -- and worry about whether the result can be scaled back without overflow. Or frexp() both arguments, and do the exponenent work by hand.



来源:https://stackoverflow.com/questions/25310051/safe-floating-point-division

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!