问题
I want to calculate a sequence of exponents of base 2. for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0). The problem is I cant calculate the exponent to being with.
I have tried shl to calculate the exponent and I have tried a loop. Neither produce the correct result, my latest attempt is below.
.586
.MODEL FLAT
INCLUDE io.h
.STACK 4096
.DATA
Exponent DWORD ?
Prompt BYTE "Enter an exponent", 0
Base DWORD 2
string BYTE 40 DUP (?)
resultLbl BYTE "The solution is", 0
Solution DWORD 20 DUP (?), 0
.CODE
_MainProc PROC
input Prompt, string, 40 ; read ASCII characters (N)
atod string ; ascii to double
mov exponent, eax ; store in memory
push Exponent
push Base
call function
add esp, 8
dtoa Solution, eax ; convert to ASCII characters
output resultLbl, Solution ; output label and sum
mov eax, 0 ; exit with return code 0
ret
_MainProc ENDP
;@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; Procedure takes base and exponent from main
Function PROC
push ebp ; stores base pointer
mov ebp, esp ; set base pointer to current position
mov ebx, [ebp + 12] ; read base (2)
mov ecx, [ebp + 8] ; read N
MainLoop: ; loop to calculate expoent
mul ebx ; multiply base by its self
add eax, ebx ; save answer in eax
dec ecx ; decrement the exponent
cmp ecx, 0 ; check if exponent is 0 yet
je exit ; Exit if exponent is 0
jg mainloop ; stay in loop if exponent is above 0
Exit:
pop ebp ; restore base pointer
ret
function ENDP
END
I want the program to produce the correct exponent result, and if you can calculate the sequence above, for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0).
回答1:
Base 2 is an extremely special case because computers use binary integers.
2^n = 1 << n i.e. 2n = 1 left-shifted n times. i.e. that bit-position is set.
You're asking for 2^0 + 2^1 + ... 2^n. A number with all bits set up to an including bit n.
That's 1 less than 2^(n+1), so the number you want is 2^(n+1) - 1.
x86 has an instruction to set a bit at a given position in a register: bts reg, reg = Bit Test and Set. On Intel CPUs, the reg-reg form decodes to a single uop, with 1-cycle latency. (https://agner.org/optimize/) It also puts the old value of the bit in CF, but normally we don't care about that when using it for this. Normally we just zero a destination register and use BTS to set a single bit.
On Ryzen, bts reg,reg is 2 uops and variable-count shifts are single-uop, so mov edx, 1 / shl edx, cl is good, too. But that's worse for code-size, and more uops on Intel CPUs.
On Intel Sandybridge-family CPUs, shl reg, cl decodes to 3 uops. BMI2 shlx reg, reg, reg is 1 uop, but BMI2 support is still far from being something you can assume, unfortunately.
(Memory-destination BTS with a register source is very slow, treating the source as an indexing into an arbitrary length bit-string, not just the addressed dword, so normally avoid it for performance.)
Calculating 2^(n+1) - 1 without overflow
(C version of this answer on another question, with compiler output.)
If we just added 1 to the input number before feeding it to BTS, it could wrap around and fail to work for 31 (which should set all bits, but would instead set bit 32%32 = 0)
Since we need one extra instruction after reading the input anyway, we can use x86's shift-and-add instruction (LEA) to do one more shift as we subtract 1. So with n=31 we start with the high bit set, and shift it out to get zero. Subtracting 1 then sets all the bits, as desired
xor edx,edx ; edx = 0
bts edx, eax ; edx |= 1<<eax
lea edx, [edx + edx - 1] ; edx = (edx<<1) - 1
The logic for this goes as follows
n BTS result (2^n) *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
It's not a coincidence that the last column is a running total of the 2nd column.
LEA with 3 "components" to the addressing mode has extra latency vs. simpler LEAs, e.g. 3-cycle latency on Sandybridge-family vs. 1 cycle. But it's still a single uop so it's a great choice for throughput.
If we really wanted to optimize, and didn't have to worry about the n=31 case overflowing, we'd write the ASCII -> int loop by hand but start with a total of 1 instead of 0, to fold the n+1 into finding n. Then bts would give us our 2^(n+1), and we could simply dec that.
There's no need to store exponent to memory and reload it again; you already have it in a register where you want it.
Your comment on the atod line is wrong, and your code wouldn't make sense if it was right. ASCII to double (like the C function strtod) would return in x87 st0 or SSE2 xmm0, not EAX. atod actually stands for ASCII (decimal) to integer. Perhaps the d stands for DWORD.
input Prompt, string, 40 ; read ASCII characters (N)
atod string ; ascii (decimal) to integer in EAX
xor edx,edx ; edx = 0
bts edx, eax ; edx |= 1<<eax
lea edx, [edx + edx - 1] ; edx = (edx<<1) - 1
dtoa Solution, edx
It takes 2 instructions to implement 1<<n so it's silly to put it into its own function. Just passing args + calling the function would have take as much work as simply using bts.
0 -> 1 -> 1
1 -> 2 -> 3
2 -> 4 -> 7
Compilers typically use mov edx,1 + mov ecx, eax + shl edx, cl. But that's a missed optimization, especially on Intel CPUs.
BMI2 shlx edx, edx, ecx would help (avoiding the mov), but is worse code-size than xor-zero + bts. And BMI2 isn't always available.
If you can't get your compiler to use BTS (or you have BMI2 for SHLX), another good implementation is (2ULL << n) - 1 so you bake the extra shift into the number you start shifting. So a count of 31 can shift the bit out and produce 0.
来源:https://stackoverflow.com/questions/56509937/how-to-calculate-the-sum-of-a-sequence-of-powers-of-2-in-x86