What's the difference between a single precision and double precision floating point operation?

后端未结

关注

 11  815

没有蜡笔的小新

What is the difference between a single precision floating point operation and double precision floating operation?

I\'m especially interested in practical terms in

相关标签:

11条回答

攒了一身酷

2020-12-04 04:53

First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.

For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.

So, basically we want to know how much accurately can the number be stored and is what we call precision.

Quoting @Alessandro here

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part

So, double can store double the amount of fractional part as of float. That is why Double is called double the float

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-04 04:54
All have explained in great detail and nothing I could add further. Though I would like to explain it in Layman's Terms or plain ENGLISH
```
1.9 is less precise than 1.99
1.99 is less precise than 1.999
1.999 is less precise than 1.9999
```
.....

A variable, able to store or represent "1.9" provides less precision than the one able to hold or represent 1.9999. These Fraction can amount to a huge difference in large calculations.
0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2020-12-04 04:55

According to the IEEE754 • Standard for floating point storage • 32 and 64 bit standards (single precision and double precision) • 8 and 11 bit exponent respectively • Extended formats (both mantissa and exponent) for intermediate results

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-04 04:56
I read a lot of answers but none seems to correctly explain where the word double comes from. I remember a very good explanation given by a University professor I had some years ago.

Recalling the style of VonC's answer, a single precision floating point representation uses a word of 32 bit.
- 1 bit for the sign, S
- 8 bits for the exponent, 'E'
- 24 bits for the fraction, also called mantissa, or coefficient (even though just 23 are represented). Let's call it 'M' (for mantissa, I prefer this name as "fraction" can be misunderstood).
Representation:
```
          S  EEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMM
bits:    31 30      23 22                     0
```
(Just to point out, the sign bit is the last, not the first.)

A double precision floating point representation uses a word of 64 bit.
- 1 bit for the sign, S
- 11 bits for the exponent, 'E'
- 53 bits for the fraction / mantissa / coefficient (even though only 52 are represented), 'M'
Representation:
```
           S  EEEEEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
bits:     63 62         52 51                                                  0
```
As you may notice, I wrote that the mantissa has, in both types, one bit more of information compared to its representation. In fact, the mantissa is a number represented without all its non-significative 0. For example,
- 0.000124 becomes 0.124 × 10⁻³
- 237.141 becomes 0.237141 × 10³
This means that the mantissa will always be in the form

0.α₁α₂...α_t × β^p

where β is the base of representation. But since the fraction is a binary number, α₁ will always be equal to 1, thus the fraction can be rewritten as 1.α₂α₃...α_t+1 × 2^p and the initial 1 can be implicitly assumed, making room for an extra bit (α_t+1).

Now, it's obviously true that the double of 32 is 64, but that's not where the word comes from.

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

With that said, it's easy to estimate the number of decimal digits which can be safely used:
- single precision: log₁₀(2²⁴), which is about 7~8 decimal digits
- double precision: log₁₀(2⁵³), which is about 15~16 decimal digits
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-12-04 04:56

Double precision means the numbers takes twice the word-length to store. On a 32-bit processor, the words are all 32 bits, so doubles are 64 bits. What this means in terms of performance is that operations on double precision numbers take a little longer to execute. So you get a better range, but there is a small hit on performance. This hit is mitigated a little by hardware floating point units, but its still there.

The N64 used a MIPS R4300i-based NEC VR4300 which is a 64 bit processor, but the processor communicates with the rest of the system over a 32-bit wide bus. So, most developers used 32 bit numbers because they are faster, and most games at the time did not need the additional precision (so they used floats not doubles).

All three systems can do single and double precision floating operations, but they might not because of performance. (although pretty much everything after the n64 used a 32 bit bus so...)

0 讨论(0)
发布评论:

提交评论
- 加载中...
失恋的感觉

2020-12-04 04:57

To add to all the wonderful answers here

First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.

For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.

So, basically we want to know how much accurately can the number be stored and is what we call precision.

Quoting @Alessandro here

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part

So, float can store double the amount of fractional part. That is why Double is called double the float

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页