Efficient Assembly multiplication

╄→гoц情女王★ 提交于 2019-12-24 05:20:32

问题


Started to practice assembly, not too long ago. I want to implement an efficient multiplying through assembly commands lea and shift. I want to write a c program that will call an assembly procedure that fits an constant argument recieved by the user and will multiply another argument recieved by the user by that constant.

How can I make this code effective?
What numbers can I group (if any) to fit the same procedure? for example I think that I can group 2,4,8,... to the same procedure as they are just a left shift by 1,2,3 for example.

But I'm having trouble finding other groups like this one with other numbers and what about negatives...


回答1:


The interesting part of this exercise is finding ways to use 1 or 2 LEA, SHL, and/or ADD/SUB instructions to implement multiplies by various constants.

Actually dispatching on the fly for a single multiply isn't very interesting, and would mean either actual JIT compiling or that you have every possible sequence already present in a giant table of tiny blocks of code. (Like switch statements.)

Instead I'd suggest writing a C or Python or whatever function that takes 1 integer arg, and as output produces the asm source text that implements x * n where n is the integer arg. i.e. a function like you might find in a compiler that optimizes a multiply-by-constant.

You might want to cook up an automated way to test this, e.g. by comparing against a pure C x * n for a couple different x values.


If you can't get the job done in 2 instructions (or 3 with one of them being mov), it's not worth it. Modern x86 has ridiculously efficient multiply in hardware. imul reg, r/m, imm is 1 uop, 3 cycle latency, fully pipelined. (AMD since Zen, Intel since Core2 or Nehalem or so.) That's your fallback for anything that you can't get done with a critical path length of 1 or 2 cycles (assuming zero-latency mov if you want, like IvyBridge+ and Zen.)

Or you could set a higher threshold before fallback if you want to explore more complicated sequences, e.g. aim for 64-bit multiply on Bulldozer-family (6 cycle latency). https://agner.org/optimize/. Or even P5 Pentium where imul takes 9 cycles (not pairable).


Patterns to look for

Integer multiply boils down to adding up shifted copies of 1 operand where the other operand has 1 bits. (See the algorithm for implementing multiply by runtime-variable values, by shift and add checking each bit one at a time.)

The easiest pattern is of course only a single set bit, i.e. a power of 2; then it's just a left shift. This is easy to check for: n & (n-1) == 0, when n != 0.

Anything with exactly 2 set bits is at most 2 shifts and an add. (GNU C __builtin_popcount(n) counts set bits. In x86 asm, SSE4.2 popcnt).

GNU C __builtin_ctz finds the bit-index of the lowest set bit. Using it on a number you know is non-zero will give you the shift count for the low bit. In x86 asm, bsf / tzcnt.

To clear that lowest set bit and "expose" the next-lowest, you can do n &= n-1;. In x86 asm, BMI1 blsr or LEA / AND.


Another interesting pattern to look for is 2n +- 1. The +1 case is already covered by the 2-set-bits case, but the shift count for the low bit is 0; no shift needed. With shift counts up to 3, you can do it in one LEA.

You can detect 2^n - 1 by checking if n+1 is a power of 2 (has only 1 bit set). Somewhat more complex, (2^n - 1) * 2^m can be done with this trick plus another shift. So you could try right-shifting to bring the lowest set bit to the bottom then looking for tricks.

GCC does this the 2^n - 1 way:

mul15:              # gcc -O3 -mtune=bdver2
        mov     eax, edi
        sal     eax, 4
        sub     eax, edi
        ret

clang is more efficient (for Intel CPUs where scaled-index is still only 1 cycle latency):

mul15:             # clang -O3 -mtune=bdver2
        lea     eax, [rdi + 4*rdi]
        lea     eax, [rax + 2*rax]
        ret

Combining these patterns

Maybe factorize your number into its prime factors and look for ways to use your building blocks to do combinations of those factors.

But this isn't the only approach. You can do x*11 as x*5*2 + x, like GCC and Clang do this (which is a lot like How to multiply a register by 37 using only 2 consecutive leal instructions in x86?)

        lea     eax, [rdi + 4*rdi]
        lea     eax, [rdi + 2*rax]

There are 2 approaches for x*17 as well. GCC and Clang do it this way:

mul17:
        mov     eax, edi
        sal     eax, 4
        add     eax, edi
        ret

But another way which they fail to use even with -march=sandybridge (no mov-elimination, 1-cycle LEA [reg + reg*scale]) is:

mul17:
        lea    eax,  [rdi + 8*rdi]  ; x*9
        lea    eax,  [rax + 8*rdi]  ; x*9 + x*8 = x*17

So instead of multiplying factors, we're adding different multipliers to make the total multiplier.


I don't have any great suggestions how to programmatically search for these sequences beyond the simple ones like 2 set bits, or 2^n +- 1. If you're curious, have a look in GCC or LLVM source code for the functions that do these optimizations; the find a lot of tricky ones.

The work might be split between target-neutral optimization passes for powers of 2 vs. x86-specific target code for using LEA, and for deciding on a threshold of how many instructions is worth it before falling back to imul-immediate.


Negative numbers

x * -8 could be done with x - x*9. I think that might be safe even if x*9 overflows but you'd have to double-check on that.


Look at compiler output

#define MULFUN(c) int mul##c(int x) { return x*c; }
MULFUN(9)
MULFUN(10)
MULFUN(11)
MULFUN(12)
...

I put that on the Godbolt compiler explorer for the x86-64 System V ABI (first arg in RDI, like the above examples). With gcc and clang -O3. I used -mtune=bdver2 (Piledriver) because it has somewhat slower multiply than Intel or Zen. This encourages GCC and Clang to avoid imul slightly more aggressively.

I didn't try if long / uint64_t would change that (6 cycle instead of 4 cycle latency, and half the throughput.) Or if an older uarch like -mtune=nocona (Pentium 4) would make a difference. -mtune=bdver2 did make a difference vs. the default tune=generic for GCC at least.

If you use -m32, you can use even older uarches like -mtune=pentium (in-order P5). I'd recommend -mregparm=3 for that so args are still passed in registers, not the stack.



来源:https://stackoverflow.com/questions/59227227/efficient-assembly-multiplication

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!