ieee-754 | 易学教程

How does float guarantee 7 digit precision?

阅读更多关于 How does float guarantee 7 digit precision?

问题 As I know Single-precision floating-point number has 1 bit for sign, 8 bits for exponent and 23 bits for mantissa. I can understand that 7 digit integers fit 23 bit mantissa and don't loose precision but can't understand how a number like 1234567000000000 fits without loose "1,2,3,4,5,6,7" digits, what is the math behind this? 回答1: The IEEE-754 basic 32-bit binary floating-point format only guarantees that six significant decimal digits will survive a round-trip conversion, not seven.

Is IEEE 754-2008 deterministic?

阅读更多关于 Is IEEE 754-2008 deterministic?

问题 If I start with the same values, and perform the same primitive operations (addition, multiplication, comparision etc.) on double-precision 64-bit IEEE 754-2008 values, will I get the same result, independent of the underlying machine? More concretely: Since ECMAScript 2015 specifies that a number values is primitive value corresponding to a double-precision 64-bit binary format IEEE 754-2008 value can I conclude that the same operations yield the same same result here, independent of the

Convert 64 bit hexadecimal to float in PHP

阅读更多关于 Convert 64 bit hexadecimal to float in PHP

问题 I'm trying to convert a 64 bit hexadecimal number to a float in PHP. 40F82C719999999A If I run that in the IEEE-754 Floating-Point Conversion page at http://babbage.cs.qc.cuny.edu/IEEE-754.old/64bit.html it converts to: 99015.100000000000 Which is the number I'm looking for. But I can't get to this number in PHP. I've tried using various combinations of pack() and unpack() but I'm not anywhere close. :( 回答1: function hex2float($strHex) { $hex = sscanf($strHex, "%02x%02x%02x%02x%02x%02x%02x

sine cosine modular extended precision arithmetic

阅读更多关于 sine cosine modular extended precision arithmetic

问题 I've seen in many impletation of sine/cosine a so called extended modular precision arithmetic. But what it is for? For instance in the cephes implemetation, after reduction to the range [0,pi/4], they are doing this modular precision arithmetic to improve the precision. Hereunder the code: z = ((x - y * DP1) - y * DP2) - y * DP3; where DP1, DP2 and DP3 are some hardcoded coefficient. How to find those coefficient mathematically? I've understand the purpose of "modular extension arithmetic"

sine cosine modular extended precision arithmetic

阅读更多关于 sine cosine modular extended precision arithmetic

extract bits from 32 bit float numbers in C

阅读更多关于 extract bits from 32 bit float numbers in C

问题 32 bits are represented in binary using the IEEE format. So how can I extract those bits? Bitwise operations like & and | do not work on them! what i basically want to do is extract the LSB from 32 bit float images in opencv thanx in advance! 回答1: uint32_t get_float_bits(float f) { assert(sizeof(float) == sizeof(uint32_t)); // or static assert uint32_t bits; memcpy(&bits, &f, sizeof f); return bits; } As of C99, the standard guarantees that the union trick works (provided the sizes match),

Converting floating point to unsigned int while preserving order

阅读更多关于 Converting floating point to unsigned int while preserving order

问题 I have found a lot of answers on SO focusing on converting float to int . I am manipulating only positive floating point values. One simple method I have been using is this: unsigned int float2ui(float arg0) { float f = arg0; unsigned int r = *(unsigned int*)&f; return r; } The above code works well yet it fails to preserve the numeric order. By order I mean this: float f1 ...; float f2 ...; assert( ( (f1 >= f2) && (float2ui(f1) >= float2ui(f2)) ) || ( (f1 < f2) && (float2ui(f1) < vfloat2ui

Converting floating point to unsigned int while preserving order

阅读更多关于 Converting floating point to unsigned int while preserving order

Converting IEEE 754 from bit stream into float in JavaScript

阅读更多关于 Converting IEEE 754 from bit stream into float in JavaScript

问题 I have serialized 32-bit floating number using GO language function (math.Float32bits) which returns the floating point number corresponding to the IEEE 754 binary representation. This number is then serialized as 32-bit integer and is read into java script as byte array. For example, here is actual number: float: 2.8088086 as byte array: 40 33 c3 85 as hex: 0x4033c385 There is a demo converter that displays the same numbers. I need to get that same floating number back from byte array in

Converting IEEE 754 from bit stream into float in JavaScript

阅读更多关于 Converting IEEE 754 from bit stream into float in JavaScript