floating-accuracy | 易学教程

F# - How to compare floats

阅读更多关于 F# - How to compare floats

问题 In F#. How to efficiently compare floats for equality that are almost equal? It should work for very large and very small values too. I am thinking of first comparing the Exponent and then the Significand (Mantissa) while ignoring the last 4 bits of the its 52 bits. Is that a good approach? How can I get the Exponent and Significand of a float? 回答1: An F# float is just a shorthand for System.Double . That being the case, you can use the BitConverter.DoubleToInt64Bits method to efficiently

Implementation details of fmt.Println in golang

阅读更多关于 Implementation details of fmt.Println in golang

问题 Consider this code import ( "fmt" "math/big" ) func main() { var b1,b2,b3,bigSum big.Float b1.SetFloat64(25.3) b2.SetFloat64(76.2) b1.SetFloat64(53.1) bigSum.Add(&b1, &b2).Add(&b3, &bigSum) fmt.Println(bigSum) // {53 0 0 1 false [9317046909104082944] 8} fmt.Println(&bigSum) // 129.3 } I have 2 questions Why I have to pass bigSum as reference (by using & ) to get the correct answer, otherwise we'll get back an object? How does Println work in Go? I mean how does it know which format it should

Division and floating points

阅读更多关于 Division and floating points

问题 Can Anyone help me why x2 prints zero. I guess because of floating point representation X1 is rounded off, is there way to keep the precession. long double x1, x2; x1= 0.087912088; // Note: 360/4095 = 0.087912088 x2 = 360/4095; printf("%Lf, %Lf \n",x1, x2); Result: x1 =0.087912 x2= 0.000000 回答1: The problem is integer truncation .. you are dividing two integers => the result will be another integer with the fractional part thrown away. (So for instance in the case when the real result of an

Does Fortran have inherent limitations on numerical accuracy compared to other languages?

阅读更多关于 Does Fortran have inherent limitations on numerical accuracy compared to other languages?

While working on a simple programming exercise, I produced a while loop (DO loop in Fortran) that was meant to exit when a real variable had reached a precise value. I noticed that due to the precision being used, the equality was never met and the loop became infinite. This is, of course, not unheard of and one is advised that, rather than comparing two numbers for equality, it is best see if the absolute difference between two numbers is less than a set threshold. What I found disappointing was how low I had to set this threshold, even with variables at double precision, for my loop to exit

Why does 0.1 + 0.4 = 0.5?

阅读更多关于 Why does 0.1 + 0.4 = 0.5?

We know that floating point is broken , because decimal numbers can't always be perfectly represented in binary. They're rounded to a number that can be represented in binary; sometimes that number is higher, and sometimes it's lower. In this case using the ubiquitous IEEE 754 double format both 0.1 and 0.4 round higher: 0.1 = 0.1000000000000000055511151231257827021181583404541015625 0.4 = 0.40000000000000002220446049250313080847263336181640625 Since both of these numbers are high, you'd expect their sum to be high as well. Perfect addition should give you 0

c++ floating point precision loss: 3015/0.00025298219406977296

阅读更多关于 c++ floating point precision loss: 3015/0.00025298219406977296

问题 The problem. Microsoft Visual C++ 2005 compiler, 32bit windows xp sp3, amd 64 x2 cpu. Code: double a = 3015.0; double b = 0.00025298219406977296; //*((unsigned __int64*)(&a)) == 0x40a78e0000000000 //*((unsigned __int64*)(&b)) == 0x3f30945640000000 double f = a/b;//3015/0.00025298219406977296; the result of calculation (i.e. "f") is 11917835.000000000 ( ((unsigned __int64 )(&f)) == 0x4166bb4160000000) although it should be 11917834.814763514 (i.e. ((unsigned __int64 )(&f)) ==

Why does (int)(33.46639 * 1000000) return 33466389?

阅读更多关于 Why does (int)(33.46639 * 1000000) return 33466389?

(int)(33.46639 * 1000000) returns 33466389 Why does this happen? Floating point math isn't perfect. What every programmer should know about it. Floating-point arithmetic is considered an esoteric subject by many people. This is rather surprising because floating-point is ubiquitous in computer systems. Almost every language has a floating-point datatype; computers from PCs to supercomputers have floating-point accelerators; most compilers will be called upon to compile floating-point algorithms from time to time; and virtually every operating system must respond to floating-point exceptions

Payne Hanek algorithm implementation in C

阅读更多关于 Payne Hanek algorithm implementation in C

I'm struggling to understand how TO IMPLEMENT the range reduction algorithm published by Payne and Hanek (range reduction for trigonometric functions) I've seen there's this library: http://www.netlib.org/fdlibm/ But it looks to me so twisted, and all the theoretical explanation i've founded are too simple to provide an implementation. Is there some good... good... good explanation of it? Performing argument reduction for trigonometric functions via the Payne-Hanek algorithm is actually pretty straightforward. As with other argument reduction schemes, compute n = round_nearest (x / (π/2)) ,

Is there a floating point value of x, for which x-x == 0 is false?

阅读更多关于 Is there a floating point value of x, for which x-x == 0 is false?

问题 In most cases, I understand that a floating point comparison test should be implemented using over a range of values (abs(x-y) < epsilon), but does self subtraction imply that the result will be zero? // can the assertion be triggered? float x = //?; assert( x-x == 0 ) My guess is that nan/inf might be special cases, but I'm more interested in what happens for simple values. edit: I'm happy to pick an answer if someone can cite a reference (IEEE floating point standard)? 回答1: As you hinted,

How to safely floor or ceil a CGFloat to int?

阅读更多关于 How to safely floor or ceil a CGFloat to int?

问题 I often need to floor or ceil a CGFloat to an int , for calculation of an array index. The problem I permanently see with floorf(theCGFloat) or ceilf(theCGFloat) is that there can be troubles with floating point inaccuracies. So what if my CGFloat is 2.0f but internally it is represented as 1.999999999999f or something like that. I do floorf and get 1.0f , which is a float again. And yet I must cast this beast to int which may introduce another problem. Is there a best practice how to floor