Why uint64_t cannot show pow(2, 64) - 1 properly?

问题

I'm trying to understand why uint64_t type can not show pow(2,64)-1 properly. The cplusplus standard is 199711L.

I checked the pow() function under C++98 standard which is

double pow (double base     , double exponent);
float pow (float base      , float exponent);
long double pow (long double base, long double exponent);
double pow (double base     , int exponent);
long double pow (long double base, int exponent);

So I wrote the following snippet

double max1 = (pow(2, 64) - 1);
cout << max1 << endl;

uint64_t max2 = (pow(2, 64) - 1);
cout << max2 << endl;

uint64_t max3 = -1;
cout << max3 << endl;

The outputs are:

max1: 1.84467e+019
max2: 9223372036854775808
max3: 18446744073709551615

回答1:

Floating point numbers have finite precision.

On your system (and typically, assuming binary64 IEEE-754 format) 18446744073709551615 is not a number that has a representation in the double format. The closest number that does have a representation happens to be 18446744073709551616.

Subtracting (and adding) together two floating point numbers of wildly different magnitudes usually produces an error. This error can be significant in relation to the smaller operand. In the case of 18446744073709551616. - 1. -> 18446744073709551616. the error of the subtraction is 1, which is in fact the same value as the smaller operand.

When a floating point value is converted to an integer type, and the value cannot fit in the integer type, the behaviour of the program is undefined - even when the integer type is unsigned.

回答2:

pow(2, 64) - 1 is a double expression, not int, as pow doesn't have any overload that returns an integral type. The literal 1 will be promoted to the same rank as the result of pow

However because IEEE-754 double precision is only 64-bit long, you can never store values that have 64 significant bits or more like 2⁶⁴-1

64-bit unsigned integers which cannot map onto a double
Are all integer values perfectly represented as doubles?

So pow(2, 64) - 1 will be rounded to the closest representable value, which is pow(2, 64) itself, and pow(2, 64) - 1 == pow(2, 64) will result in 1. The largest value that's smaller than it is 18446744073709549568 = 2⁶⁴ - 2048. You can check that with std::nextafter

On some platforms (notably x86, except on MSVC) long double does have 64 bits of significand, so you'll get the correct value in that case. The following snippet

double max1 = pow(2, 64) - 1;
std::cout << "pow(2, 64) - 1 = " << std::fixed << max1 << '\n';
std::cout << "Previous representable value: " << std::nextafter(max1, 0) << '\n';
std::cout << (pow(2, 64) - 1 == pow(2, 64)) << '\n';

long double max2 = pow(2.0L, 64) - 1.0L;
std::cout << std::fixed << max2 << '\n';

prints out

pow(2, 64) - 1 = 18446744073709551616.000000
Previous representable value: 18446744073709549568.000000
1
18446744073709551615.000000

On many other platforms double may be IEEE-754 quadruple-precision or double-double. Both have more than 64 bits of significand so you can do the same thing. But of course the overhead will be higher

Anyway you shouldn't use a floating-point type for integer math right from the beginning. Not only it's far slower to calculate pow(2, x) than 1ULL << x, it'll also cause the issue you saw due to the limited precision of double. Use uint64_t max2 = -1 instead, or ((unsigned __int128)1ULL << 64) - 1 if the compiler supports that type

来源：https://stackoverflow.com/questions/54948430/why-uint64-t-cannot-show-pow2-64-1-properly

标签

c++

floating-point

double

uint64