How are these double precision values accurate to 20 decimals?

问题

I am testing some very simple equivalence errors when precision is an issue and was hoping to perform the operations in extended double precision (so that I knew what the answer would be in ~19 digits) and then perform the same operations in double precision (where there would be roundoff error in the 16th digit), but somehow my double precision arithmetic is maintaining 19 digits of accuracy.

When I perform the operations in extended double, then hardcode the numbers into another Fortran routine, I get the expected errors, but is there something strange going on when I assign an extended double precision variable to a double precision variable here?

program code_gen
    implicit none 
    integer, parameter :: Edp = selected_real_kind(17)
    integer, parameter :: dp = selected_real_kind(8)
    real(kind=Edp) :: alpha10, x10, y10, z10 
    real(kind=dp) :: alpha8, x8, y8, z8

    real(kind = dp) :: pi_dp = 3.1415926535897932384626433832795028841971693993751058209749445

    integer :: iter
    integer :: niters = 10

    print*, 'tiny(x10) = ', tiny(x10)
    print*, 'tiny(x8)  = ', tiny(x8)
    print*, 'epsilon(x10) = ', epsilon(x10)
    print*, 'epsilon(x8)  = ', epsilon(x8)

    do iter = 1,niters
        x10 = rand()
        y10 = rand()
        z10 = rand()
        alpha10 = x10*(y10+z10)

        x8 = x10 
        x8 = x8 - pi_dp
        x8 = x8 + pi_dp
        y8 = y10 
        y8 = y8 - pi_dp
        y8 = y8 + pi_dp
        z8 = z10 
        z8 = z8 - pi_dp
        z8 = z8 + pi_dp
        alpha8 = alpha10

        write(*, '(a, es30.20)') 'alpha8 .... ', x8*(y8+z8)
        write(*, '(a, es30.20)') 'alpha10 ... ', alpha10

        if( alpha8 .gt. x8*(y8+z8) ) then
            write(*, '(a)') 'ERROR(.gt.)'
        elseif( alpha8 .lt. x8*(y8+z8) ) then
            write(*, '(a)') 'ERROR(.lt.)'
        endif
    enddo
end program code_gen

where rand() is the gfortran function found here.

If we are speaking about only one precision type (take, for example, double), then we can denote machine epsilon as E16 which is approximately 2.22E-16. If we take a simple addition of two Real numbers, x+y, then the resulting machine expressed number is (x+y)*(1+d1) where abs(d1) < E16. Likewise, if we then multiply that number by z, the resulting value is really (z*((x+y)*(1+d1))*(1+d2)) which is nearly (z*(x+y)*(1+d1+d2)) where abs(d1+d2) < 2*E16. If we now move to extended double precision, then the only thing that changes is that E16 turns to E20 and has a value of around 1.08E-19.

My hope was to perform the analysis in extended double precision so that I could compare two numbers which should be equal but show that, on occasion, roundoff error will cause comparisons to fail. By assigning x8=x10, I was hoping to create a double precision 'version' of the extended double precision value x10, where only the first ~16 digits of x8 conform to the values of x10, but upon printing out the values, it shows that all 20 digits are the same and the expected double precision roundoff error is not occurring as I would expect.

It should also be noted that before this attempt, I wrote a program which actually writes another program where the values of x, y, and z are 'hardcoded' to 20 decimal places. In this version of the program, the comparisons of .gt. and .lt. failed as expected, but I am not able to duplicate the same failures by casting an extended double precision value as a double precision variable.

In an attempt to further 'perturb' the double precision values and add roundoff error, I have added, then substracted, pi from my double precision variables which should leave the remaining variables with some double precision roundoff error, but I am still not seeing that in the final result.

回答1:

As the gfortran documentation you link states, the function result of rand is a default real value (single precision). Such a value can be represented exactly by each of your other real types.

That is, x10=rand() assigns a single precision value to the extended precision variable x10. It does so exactly. This same value now stored in x10 is assigned to the double precision variable x8, but this remains exactly representable as double precision.

There is sufficient precision in the single-as-double that the calculations using double and extended types return the same value. [See the note at the end of this answer.]

If you wish to see real effects of loss of precision, then start by using an extended or double precision value. For example, rather than using rand (returning a single precision value), use the intrinsic random_number

call random_number(x10)

(which has the advantage of being standard Fortran). Unlike a function, which in (nearly) all cases returns a value type regardless of the end use of the value, this subroutine will give you a precision corresponding to the argument. You will (hopefully) see much as you will from your "hard-coded" experiment.

Alternatively, as agentp commented, it may be more intuitive to start with a double precision value

call random_number(x8); x10=x8   ! x8 and x10 have the precision of double precision
call random_number(y8); y10=y8
call random_number(z8); z10=z8

and perform the calculations from that starting point: those extra bits will then start to show.

In summary, when you do x8=x10 you are getting the first few bits of x8 corresponding to those of x10, but many of those bits and those that follow in x10 are all zero.

When it comes to your pi_dp perturbation, you are again assigning a single precision (this time a literal constant) value to a double precision variable. Just having all those digits doesn't make it anything other than a default real literal. You can specify a different kind of literal with a _Edp suffix, as described in other answers.

Finally, one also then has to worry about what the compiler does with regards to optimization.

My thesis is that starting from the single precision value, the calculations performed are representable exactly in both double and extended precision (with the same values). For other calculations, or from a starting point with more bits set, or representations (for example, on some systems or with other compilers the numeric type with kind selected_real_kind(17) may have quite different characteristics such as a different radix) that needn't be the case.

While this was largely based on guessing and hoping it explained the observation. Fortunately, there are ways to test this idea. As we're talking about IEEE arithmetic we can consider the inexact flag. If that flag isn't raised during the computation we can be happy.

With gfortran there is the compilation option -ffpe=inexact which will make the inexact flag signalling. With gfortran 5.0 the intrinsic module ieee_exceptions is supported which can be used in a portable/standard manner.

You can consider this flag for further experimentation: if it is raised then you can expect to see differences between the two precisions.

来源：https://stackoverflow.com/questions/34639858/how-are-these-double-precision-values-accurate-to-20-decimals

标签

fortran

precision