Double-precision operations: 32-bit vs 64-bit machines

问题

Why don't we see twice better performance when executing a 64-bit operations (e.g. Double precision operation) on a 64-bit machine, compared to executing on a 32-bit machine?

In a 32-bit machine, don't we need to fetch from memory twice as much? more importantly, dont we need twice as much cycles to execute a 64-bit operation?

回答1:

“64-bit machine” is an ambiguous term but usually means that the processor's General-Purpose Registers are 64-bit wide. Compare 8086 and 8088, which have the same instruction set and can both be called 16-bit processors in this sense.

When the phrase is used in this sense, it has nothing to do with the width of the memory bus, the width of the internal buses inside the CPU, and the ability of the ALU to operate efficiently on 32- or 64-bit wide data.

Your question also assumes that the hardest part of a multiplication is moving the operands to the unit that takes care of multiplication inside the processor, which wouldn't be quite true even if the operands came from memory and the bus was 32-bit wide, because latency != throughput. Also, regarding the mathematics of floating-point multiplication, a 64-bit multiplication is not twice as hard as a 32-bit one, it is roughly (53/24)² times as hard (but, again, the transistors can be there to compute the double-precision multiplication efficiently regardless of the width of the General-Purpose Registers).

回答2:

In a 32-bit machine, don't we need to fetch from memory twice as much?

No. In most modern CPUs, memory bandwidth is at least 64 bits. Newer microarchitectures may have wider bus. Quad-channel memory will have CPU-RAM bandwidth of at least 256 bits. So you need only 1 fetch to get a double. Besides most of the time the value has already been in cache, so loading it won't take much time.

more importantly, dont we need twice as much cycles to execute a 64-bit operation?

First you should know that the actual number of significant bits in double is only 53 so it's not "twice as much" harder.

To operate on those floating-point values you need to load them to registers. And once they're loaded, the performance will not be different as long as the ALU can do double-precision maths fast in 1 instruction. For architectures that allow one memory operand like x86, if the value was fetched into cache it makes no almost no difference as to operate on registers.

With SSE2/AVX/AVX-512 the ALU may even be able to process 2/4/8 doubles at a time so you can see that only double like that is not much work for it. In the old x87 the internal registers are 80 bits in length and both single and double precision must be extended to 80 bits, hence their performance will also be the same.

来源：https://stackoverflow.com/questions/28297228/double-precision-operations-32-bit-vs-64-bit-machines

标签

performance

memory

32bit-64bit

cpu-registers

cpu-architecture