What precision are floating-point arithmetic operations done in?

倾然丶 夕夏残阳落幕 提交于 2021-02-07 06:29:06

问题


Consider two very simple multiplications below:

double result1;
long double result2;
float var1=3.1;
float var2=6.789;
double var3=87.45;
double var4=234.987;

result1=var1*var2;
result2=var3*var4;

Are multiplications by default done in a higher precision than the operands? I mean in case of first multiplication is it done in double precision and in case of second one in x86 architecture is it done in 80-bit extended-precision or we should cast operands in expressions to the higher precision ourselves like below?

result1=(double)var1*(double)var2;
result2=(long double)var3*(long double)var4;

What about other operations(add, division and remainder)? For example when adding more than two positive single-precision values, using extra significant bits of double-precision can decrease round-off errors if used to hold intermediate results of expression.


回答1:


Precision of floating-point computations

C++11 incorporates the definition of FLT_EVAL_METHOD from C99 in cfloat.

FLT_EVAL_METHOD     

Possible values:
-1 undetermined
 0 evaluate just to the range and precision of the type
 1 evaluate float and double as double, and long double as long double.
 2 evaluate all as long double 

If your compiler defines FLT_EVAL_METHOD as 2, then the computations of r1 and r2, and of s1 and s2 below are respectively equivalent:

double var3 = …;
double var4 = …;

double r1 = var3 * var4;
double r2 = (long double)var3 * (long double)var4;

long double s1 = var3 * var4;
long double s2 = (long double)var3 * (long double)var4;

If your compiler defines FLT_EVAL_METHOD as 2, then in all four computations above, the multiplication is done at the precision of the long double type.

However, if the compiler defines FLT_EVAL_METHOD as 0 or 1, r1 and r2, and respectively s1 and s2, aren't always the same. The multiplications when computing r1 and s1 are done at the precision of double. The multiplications when computing r2 and s2 are done at the precision of long double.

Getting wide results from narrow arguments

If you are computing results that are destined to be stored in a wider result type than the type of the operands, as are result1 and result2 in your question, you should always convert the arguments to a type at least as wide as the target, as you do here:

result2=(long double)var3*(long double)var4;

Without this conversion (if you write var3 * var4), if the compiler's definition of FLT_EVAL_METHOD is 0 or 1, the product will be computed in the precision of double, which is a shame, since it is destined to be stored in a long double.

If the compiler defines FLT_EVAL_METHOD as 2, then the conversions in (long double)var3*(long double)var4 are not necessary, but they do not hurt either: the expression means exactly the same thing with and without them.

Digression: if the destination format is as narrow as the arguments, when is extended-precision for intermediate results better?

Paradoxically, for a single operation, rounding only once to the target precision is best. The only effect of computing a single multiplication in extended precision is that the result will be rounded to extended precision and then to double precision. This makes it less accurate. In other words, with FLT_EVAL_METHOD 0 or 1, the result r2 above is sometimes less accurate than r1 because of double-rounding, and if the compiler uses IEEE 754 floating-point, never better.

The situation is different for larger expressions that contain several operations. For these, it is usually better to compute intermediate results in extended precision, either through explicit conversions or because the compiler uses FLT_EVAL_METHOD == 2. This question and its accepted answer show that when computing with 80-bit extended precision intermediate computations for binary64 IEEE 754 arguments and results, the interpolation formula u2 * (1.0 - u1) + u1 * u3 always yields a result between u2 and u3 for u1 between 0 and 1. This property may not hold for binary64-precision intermediate computations because of the larger rounding errors then.




回答2:


The usual arthimetic conversions for floating point types are applied before multiplication, division, and modulus:

The usual arithmetic conversions are performed on the operands and determine the type of the result.

§5.6 [expr.mul]

Similarly for addition and subtraction:

The usual arithmetic conversions are performed for operands of arithmetic or enumeration type.

§5.7 [expr.add]

The usual arithmetic conversions for floating point types are laid out in the standard as follows:

Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield result types in a similar way. The purpose is to yield a common type, which is also the type of the result. This pattern is called the usual arithmetic conversions, which are defined as follows:

[...]

— If either operand is of type long double, the other shall be converted to long double.

— Otherwise, if either operand is double, the other shall be converted to double.

— Otherwise, if either operand is float, the other shall be converted to float.

§5 [expr]

The actual form/precision of these floating point types is implementation-defined:

The type double provides at least as much precision as float, and the type long double provides at least as much precision as double. The set of values of the type float is a subset of the set of values of the type double; the set of values of the type double is a subset of the set of values of the type long double. The value representation of floating-point types is implementation-defined.

§3.9.1 [basic.fundamental]




回答3:


  1. For floating point multiplication: FP multipliers use internally double the width of the operands to generate an intermediate result, which equals the real result within an infinite precision, and then round it to the target precision. Thus you should not worry about multiplication. The result is correctly rounded.
  2. For floating point addition, the result is also correctly rounded as standard FP adders use extra sufficient 3 guard bits to compute a correctly rounded result.
  3. For division, remainder and other complicated functions, like transcendentals such as sin, log, exp, etc... it depends mainly on the architecture and the used libraries. I recommend you to use the MPFR library if you seek correctly rounded results for division or any other complicated function.



回答4:


Not a direct answer to your question, but for constant floating-point values (such as the ones specified in your question), the method that yields the least amount of precision-loss would be using the rational representation of each value as an integer numerator divided by an integer denominator, and perform as many integer-multiplications as possible before the actual floating-point-division.

For the floating-point values specified in your question:

int var1_num = 31;
int var1_den = 10;
int var2_num = 6789;
int var2_den = 1000;
int var3_num = 8745;
int var3_den = 100;
int var4_num = 234987;
int var4_den = 1000;
double result1 = (double)(var1_num*var2_num)/(var1_den*var2_den);
long double result2 = (long double)(var3_num*var4_num)/(var3_den*var4_den);

If any of the integer-products are too large to fit in an int, then you can use larger integer types:

unsigned int
signed   long
unsigned long
signed   long long
unsigned long long


来源:https://stackoverflow.com/questions/25302480/what-precision-are-floating-point-arithmetic-operations-done-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!