ieee-754

How does float guarantee 7 digit precision?

与世无争的帅哥 提交于 2021-02-20 04:23:25
问题 As I know Single-precision floating-point number has 1 bit for sign, 8 bits for exponent and 23 bits for mantissa. I can understand that 7 digit integers fit 23 bit mantissa and don't loose precision but can't understand how a number like 1234567000000000 fits without loose "1,2,3,4,5,6,7" digits, what is the math behind this? 回答1: The IEEE-754 basic 32-bit binary floating-point format only guarantees that six significant decimal digits will survive a round-trip conversion, not seven.

Is IEEE 754-2008 deterministic?

拥有回忆 提交于 2021-02-19 01:31:06
问题 If I start with the same values, and perform the same primitive operations (addition, multiplication, comparision etc.) on double-precision 64-bit IEEE 754-2008 values, will I get the same result, independent of the underlying machine? More concretely: Since ECMAScript 2015 specifies that a number values is primitive value corresponding to a double-precision 64-bit binary format IEEE 754-2008 value can I conclude that the same operations yield the same same result here, independent of the

Convert 64 bit hexadecimal to float in PHP

随声附和 提交于 2021-02-18 17:08:48
问题 I'm trying to convert a 64 bit hexadecimal number to a float in PHP. 40F82C719999999A If I run that in the IEEE-754 Floating-Point Conversion page at http://babbage.cs.qc.cuny.edu/IEEE-754.old/64bit.html it converts to: 99015.100000000000 Which is the number I'm looking for. But I can't get to this number in PHP. I've tried using various combinations of pack() and unpack() but I'm not anywhere close. :( 回答1: function hex2float($strHex) { $hex = sscanf($strHex, "%02x%02x%02x%02x%02x%02x%02x

sine cosine modular extended precision arithmetic

给你一囗甜甜゛ 提交于 2021-02-18 07:27:52
问题 I've seen in many impletation of sine/cosine a so called extended modular precision arithmetic. But what it is for? For instance in the cephes implemetation, after reduction to the range [0,pi/4], they are doing this modular precision arithmetic to improve the precision. Hereunder the code: z = ((x - y * DP1) - y * DP2) - y * DP3; where DP1, DP2 and DP3 are some hardcoded coefficient. How to find those coefficient mathematically? I've understand the purpose of "modular extension arithmetic"

sine cosine modular extended precision arithmetic

拟墨画扇 提交于 2021-02-18 07:27:05
问题 I've seen in many impletation of sine/cosine a so called extended modular precision arithmetic. But what it is for? For instance in the cephes implemetation, after reduction to the range [0,pi/4], they are doing this modular precision arithmetic to improve the precision. Hereunder the code: z = ((x - y * DP1) - y * DP2) - y * DP3; where DP1, DP2 and DP3 are some hardcoded coefficient. How to find those coefficient mathematically? I've understand the purpose of "modular extension arithmetic"

extract bits from 32 bit float numbers in C

筅森魡賤 提交于 2021-02-08 10:36:30
问题 32 bits are represented in binary using the IEEE format. So how can I extract those bits? Bitwise operations like & and | do not work on them! what i basically want to do is extract the LSB from 32 bit float images in opencv thanx in advance! 回答1: uint32_t get_float_bits(float f) { assert(sizeof(float) == sizeof(uint32_t)); // or static assert uint32_t bits; memcpy(&bits, &f, sizeof f); return bits; } As of C99, the standard guarantees that the union trick works (provided the sizes match),

Converting floating point to unsigned int while preserving order

三世轮回 提交于 2021-02-08 03:01:50
问题 I have found a lot of answers on SO focusing on converting float to int . I am manipulating only positive floating point values. One simple method I have been using is this: unsigned int float2ui(float arg0) { float f = arg0; unsigned int r = *(unsigned int*)&f; return r; } The above code works well yet it fails to preserve the numeric order. By order I mean this: float f1 ...; float f2 ...; assert( ( (f1 >= f2) && (float2ui(f1) >= float2ui(f2)) ) || ( (f1 < f2) && (float2ui(f1) < vfloat2ui

Converting floating point to unsigned int while preserving order

天涯浪子 提交于 2021-02-08 02:59:02
问题 I have found a lot of answers on SO focusing on converting float to int . I am manipulating only positive floating point values. One simple method I have been using is this: unsigned int float2ui(float arg0) { float f = arg0; unsigned int r = *(unsigned int*)&f; return r; } The above code works well yet it fails to preserve the numeric order. By order I mean this: float f1 ...; float f2 ...; assert( ( (f1 >= f2) && (float2ui(f1) >= float2ui(f2)) ) || ( (f1 < f2) && (float2ui(f1) < vfloat2ui

Converting IEEE 754 from bit stream into float in JavaScript

给你一囗甜甜゛ 提交于 2021-02-06 13:57:03
问题 I have serialized 32-bit floating number using GO language function (math.Float32bits) which returns the floating point number corresponding to the IEEE 754 binary representation. This number is then serialized as 32-bit integer and is read into java script as byte array. For example, here is actual number: float: 2.8088086 as byte array: 40 33 c3 85 as hex: 0x4033c385 There is a demo converter that displays the same numbers. I need to get that same floating number back from byte array in

Converting IEEE 754 from bit stream into float in JavaScript

一个人想着一个人 提交于 2021-02-06 13:52:45
问题 I have serialized 32-bit floating number using GO language function (math.Float32bits) which returns the floating point number corresponding to the IEEE 754 binary representation. This number is then serialized as 32-bit integer and is read into java script as byte array. For example, here is actual number: float: 2.8088086 as byte array: 40 33 c3 85 as hex: 0x4033c385 There is a demo converter that displays the same numbers. I need to get that same floating number back from byte array in