Our C++ library currently uses time_t for storing time values. I\'m beginning to need sub-second precision in some places, so a larger data type will be necessary there anyw
More than you ever wanted to know about doing 64-bit math in 32-bit mode...
When you use 64-bit numbers on 32-bit mode (even on 64-bit CPU if an code is compiled for 32-bit), they are stored as two separate 32-bit numbers, one storing higher bits of a number, and another storing lower bits. The impact of this depends on an instruction. (tl;dr - generally, doing 64-bit math on 32-bit CPU is in theory 2 times slower, as long you don't divide/modulo, however in practice the difference is going to be smaller (1.3x would be my guess), because usually programs don't just do math on 64-bit integers, and also because of pipelining, the difference may be much smaller in your program).
Many architectures support so called carry flag. It's set when the result of addition overflows, or result of subtraction doesn't underflow. The behaviour of those bits can be show with long addition and long subtraction. C in this example shows either a bit higher than the highest representable bit (during operation), or a carry flag (after operation).
C 7 6 5 4 3 2 1 0 C 7 6 5 4 3 2 1 0
0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 1 - 0 0 0 0 0 0 0 1
= 1 0 0 0 0 0 0 0 0 = 0 1 1 1 1 1 1 1 1
Why is carry flag relevant? Well, it just so happens that CPUs usually have two separate addition and subtraction operations. In x86, the addition operations are called add
and adc
. add
stands for addition, while adc
for addition with carry. The difference between those is that adc
considers a carry bit, and if it is set, it adds one to the result.
Similarly, subtraction with carry subtracts 1 from the result if carry bit is not set.
This behaviour allows easily implementing arbitrary size addition and subtraction on integers. The result of addition of x and y (assuming those are 8-bit) is never bigger than 0x1FE
. If you add 1
, you get 0x1FF
. 9 bits is enough therefore to represent results of any 8-bit addition. If you start addition with add
, and then add any bits beyond initial ones with adc
, you can do addition on any size of data you like.
Addition of two 64-bit values on 32-bit CPU is as follows.
Analogically for subtraction.
This gives 2 instructions, however, because of instruction pipelinining, it may be slower than that, as one calculation depends on the other one to finish, so if CPU doesn't have anything else to do than 64-bit addition, CPU may wait for the first addition to be done.
It so happens on x86 that imul
and mul
can be used in such a way that overflow is stored in edx register. Therefore, multiplying two 32-bit values to get 64-bit value is really easy. Such a multiplication is one instruction, but to make use of it, one of multiplication values must be stored in eax.
Anyway, for a more general case of multiplication of two 64-bit values, they can be calculated using a following formula (assume function r removes bits beyond 32 bits).
First of all, it's easy to notice the lower 32 bits of a result will be multiplication of lower 32 bits of multiplied variables. This is due to congrugence relation.
a1 ≡ b1 (mod n)
a2 ≡ b2 (mod n)
a1a2 ≡ b1b2 (mod n)
Therefore, the task is limited to just determining the higher 32 bits. To calculate higher 32 bits of a result, following values should be added together.
This gives about 5 instructions, however because of relatively limited number of registers in x86 (ignoring extensions to an architecture), they cannot take too much advantage of pipelining. Enable SSE if you want to improve speed of multiplication, as this increases number of registers.
I don't know how it works, but it's much more complex than addition, subtraction or even multiplication. It's likely to be ten times slower than division on 64-bit CPU however. Check "Art of Computer Programming, Volume 2: Seminumerical Algorithms", page 257 for more details if you can understand it (I cannot in a way that I could explain it, unfortunately).
If you divide by a power of 2, please refer to shifting section, because that's what essentially compiler can optimize division to (plus adding the most significant bit before shifting for signed numbers).
Considering those operations are single bit operations, nothing special happens here, just bitwise operation is done twice.
Interestingly, x86 actually has an instruction to perform 64-bit left shift called shld
, which instead of replacing the least significant bits of value with zeros, it replaces them with most significant bits of a different register. Similarly, it's the case for right shift with shrd
instruction. This would easily make 64-bit shifting a two instructions operation.
However, that's only a case for constant shifts. When a shift is not constant, things get tricker, as x86 architecture only supports shift with 0-31 as a value. Anything beyond that is according to official documentation undefined, and in practice, bitwise and operation with 0x1F is performed on a value. Therefore, when a shift value is higher than 31, one of value storages is erased entirely (for left shift, that's lower bytes, for right shift, that's higher bytes). The other one gets the value that was in the register that was erased, and then shift operation is performed. This in result, depends on branch predictor to make good predictions, and is a bit slower because a value needs to be checked.
__builtin_popcount(lower) + __builtin_popcount(higher)
I'm too lazy to finish the answer at this point. Does anyone even use those?
Addition, subtraction, multiplication, or, and, xor, shift left generate the exact same code. Shift right uses only slightly different code (arithmetic shift vs logical shift), but structurally it's the same. It's likely that division does generate a different code however, and signed division is likely to be slower than unsigned division.
Benchmarks? They are mostly meaningless, as instruction pipelining will usually lead to things being faster when you don't constantly repeat the same operation. Feel free to consider division slow, but nothing else really is, and when you get outside of benchmarks, you may notice that because of pipelining, doing 64-bit operations on 32-bit CPU is not slow at all.
Benchmark your own application, don't trust micro-benchmarks that don't do what your application does. Modern CPUs are quite tricky, so unrelated benchmarks can and will lie.