Efficiently dividing unsigned value by a power of two, rounding up

我的未来我决定 提交于 2019-11-28 21:12:09

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).

I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.


It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.

If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.

Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.

Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).


It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).

I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).


Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.

ceil_div_pjc_branch:
 xor    eax,eax          ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
 sub    rdi, 1
 jc    .d_was_zero       ; fuses with the sub on SnB-family
 tzcnt  rax, rsi         ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
 shrx   rax, rdi, rax
 inc    rax
.d_was_zero:
 ret
  • Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
  • HSW/SKL latency with successful branch prediction:
    • (d==0): No data dependency on d or q, breaks the dep chain. (control dependency on d to detect mispredicts and retire the branch).
    • (d!=0): q->result: tzcnt+shrx+inc = 5c
    • (d!=0): d->result: sub+shrx+inc = 3c
  • Throughput: probably just bottlenecked on uop throughput

I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)


Branchless with ADC / SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.

ceil_div_pjc_asm_adc:
 tzcnt  rsi, rsi
 sub    rdi, 1
 adc    rdi, 0          ; d? d-1 : d.  Sets CF=CF
 shrx   rax, rdi, rsi
 sbb    rax, -1         ; result++ if d was non-zero
 ret    
  • Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
  • SKL latency:
    • q->result: tzcnt+shrx+sbb = 5c
    • d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c
  • Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best one iteration per 1.5c.

If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.

Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate operands for ADC or SBB, and stores the carry into an integer reg between every operation. gcc7, clang3.9, and icc17 all make terrible code from this.

#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
  T e = lg(q);
  unsigned long long dm1;  // unsigned __int64
  unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
  CF = _addcarry_u64(CF, 0, dm1, &dm1);
  T shifted = dm1 >> e;
  _subborrow_u64(CF, shifted, -1, &dm1);
  return dm1;
}
    # gcc5.4 -O3 -march=haswell
    mov     rax, -1
    tzcnt   rsi, rsi
    add     rdi, rax
    setc    cl
    xor     edx, edx
    add     cl, -1
    adc     rdi, rdx
    setc    dl
    shrx    rdi, rdi, rsi
    add     dl, -1
    sbb     rax, rdi
    ret

CMOV to fix the whole result: worse latency from q->result, since it's used sooner in the d->result dep chain.

ceil_div_pjc_asm_cmov:
 tzcnt  rsi, rsi
 sub    rdi, 1
 shrx   rax, rdi, rsi
 lea    rax, [rax+1]     ; inc preserving flags
 cmovc  rax, zeroed_register
 ret    
  • Fused-domain uops: 5 (not counting ret) on SKL. 6 on HSW
  • SKL latency:
    • q->result: tzcnt+shrx+lea+cmov = 6c (worse than ADC/SBB by 1c)
    • d->result: sub+shrx(dep on q begins here)+lea+cmov = 4c
  • Throughput: TZCNT runs on p1. LEA is p15. CMOV and SHRX are p06. SUB is p0156. In theory only bottlenecked on fused-domain uop throughput, so one iteration per 1.25c. With lots of independent operations, resource conflicts from SUB or LEA stealing p1 or p06 shouldn't be a throughput problem because at 1 iter per 1.25c, no port is saturated with uops that can only run on that port.

CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if d is ready before q, or at the same time.

This doesn't achieve that goal, and needs an extra 7-byte mov rdx,-1 in the loop.

ceil_div_pjc_asm_cmov:
 tzcnt  rsi, rsi
 mov    rdx, -1
 sub    rdi, 1
 shrx   rax, rdi, rsi
 cmovnc rdx, rax
 sub    rax, rdx       ; res += d ? 1 : -res
 ret    

Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.

ceil_div_pjc_asm_setcc:
 xor    edx, edx        ; needed every iteration

 tzcnt  rsi, rsi
 sub    rdi, 1

 setc   dl              ; d!=0 ?  0 : 1
 dec    rdx             ; d!=0 ? -1 : 0   // AND-mask

 shrx   rax, rdi, rsi
 inc    rax
 and    rax, rdx        ; zero the bogus result if d was initially 0
 ret    

Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.

Of course, for pre-HSW, you also need to replace SHRX.

We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on sub rdi, 1, but otherwise the same instructions as above). See it on Godbolt.

T ceil_div_andmask(T p, T q) {
    T mask = -(T)(p!=0);  // TEST+SETCC+NEG
    T e = lg(q);
    T nonzero_result = ((p-1) >> e) + 1;
    return nonzero_result & mask;
}

When the compiler knows that p is non-zero, it takes advantage and makes nice code:

// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif

T ceil_div_andmask_nonzerop(T p, T q) {
  assume(p!=0);
  return ceil_div_andmask(p, q);
}
    # gcc5.4 -O3 -march=haswell
    xor     eax, eax             # gcc7 does tzcnt in-place instead of wasting an insn on this
    sub     rdi, 1
    tzcnt   rax, rsi
    shrx    rax, rdi, rax
    add     rax, 1
    ret

Chux / user23_variant

only 3c latency from p->result, and constant q can CSE a lot.

T divide_A_chux(T p, T q) {
  bool round_up = p & (q-1);  // compiles differently from user23_variant with clang: AND instead of 
  return (p >> lg(q)) + round_up;
}

    xor     eax, eax           # in-place tzcnt would save this
    xor     edx, edx           # target for setcc
    tzcnt   rax, rsi
    sub     rsi, 1
    test    rsi, rdi
    shrx    rdi, rdi, rax
    setne   dl
    lea     rax, [rdx+rdi]
    ret

Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the xor eax,eax. I haven't looked at how this inlines in a loop.

  • Fused-domain uops: 8 (not counting ret) on HSW/SKL
  • HSW/SKL latency:
    • q->result: (tzcnt+shrx(p) | sub+test(p)+setne) + lea(or add) = 5c
    • d->result: test(dep on q begins here)+setne+lea = 3c. (the shrx->lea chain is shorter, and thus not the critical path)
  • Throughput: Probably just bottlenecked on the frontend, at one iter per 2c. Saving the xor eax,eax should speed this up to one per 1.75c (but of course any loop overhead will be part of the bottleneck, because frontend bottlenecks are like that).
uint64_t exponent = lg(q);
uint64_t mask = q - 1;
//     v divide
return (p >> exponent) + (((p & mask) + mask) >> exponent)
//                       ^ round up

The separate computation of the "round up" part avoids the overflow issues of (p+q-1) >> lg(q). Depending on how smart your compiler is, it might be possible to express the "round up" part as ((p & mask) != 0) without branching.

Chris Dodd

The efficient way of dividing by a power of 2 for an unsigned integer in C is a right shift -- shifting right one divides by two (rounding down), so shifting right by n divides by 2n (rounding down).

Now you want to round up rather than down, which you can do by first adding 2n-1, or equivalently subtracting one before the shift and adding one after (except for 0). This works out to something like:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return d ? ((d-1) >> e) + 1 : 0;
}

The conditional can be removed by using the boolean value of d for addition and subtraction of one:

unsigned ceil_div(unsigned d, unsigned e) {
    /* compute ceil(d/2**e) */
    return ((d - !!d) >> e) + !!d;
}

Due to its size, and the speed requirement, the function should be made static inline. It probably won't make a different for the optimizer, but the parameters should be const. If it must be shared among many files, define it in a header:

static inline unsigned ceil_div(const unsigned d, const unsigned e){...
chux

Efficiently dividing unsigned value by a power of two, rounding up

[Re-write] given OP's clarification concerning power-of-2.

The round-up or ceiling part is easy when overflow is not a concern. Simple add q-1, then shift.

Otherwise as the possibility of rounding depends on all the bits of p smaller than q, detection of those bits is needed first before they are shifted out.

uint64_t divide_A(uint64_t p, uint64_t q) {
  bool round_up = p & (q-1);
  return (p >> lg64(q)) + round_up;
} 

This assumes code has an efficient lg64(uint64_t x) function, which returns the base-2 log of x, if x is a power-of-two.`

My old answer didn't work if p was one more than a power of two (whoops). So my new solution, using the __builtin_ctzll() and __builtin_ffsll() functions0 available in gcc (which as a bonus, provides the fast logarithm you mentioned!):

uint64_t divide(uint64_t p,uint64_t q) {
    int lp=__builtin_ffsll(p);
    int lq=__builtin_ctzll(q);
    return (p>>lq)+(lp<(lq+1)&&lp);
}

Note that this is assuming that a long long is 64 bits. It has to be tweaked a little otherwise.

The idea here is that if we need an overflow if and only if p has fewer trailing zeroes than q. Note that for a power of two, the number of trailing zeroes is equal to the logarithm, so we can use this builtin for the log as well.

The &&lp part is just for the corner case where p is zero: otherwise it will output 1 here.

0 Can't use __builtin_ctzll() for both because it is undefined if p==0.

If the dividend/divisor can be guaranteed not to exceed 63 (or 31) bits, you can use the following version mentioned in the question. Note how p+q could overflow if they use all 64 bit. This would be fine if the SHR instruction shifted in the carry flag, but AFAIK it doesn't.

uint64_t divide(uint64_t p, uint64_t q) {
  return (p + q - 1) >> lg(q);
}

If those constraints cannot be guaranteed, you can just do the floor method and then add 1 if it would round up. This can be determined by checking if any bits in the dividend are within the range of the divisor. Note: p&-p extracts the lowest set bit on 2s complement machines or the BLSI instruction

uint64_t divide(uint64_t p, uint64_t q) {
  return (p >> lg(q)) + ( (p & -p ) < q );
}

Which clang compiles to:

    bsrq    %rax, %rsi
    shrxq   %rax, %rdi, %rax
    blsiq   %rdi, %rcx
    cmpq    %rsi, %rcx
    adcq    $0, %rax
    retq

That's a bit wordy and uses some newer instructions, so maybe there is a way to use the carry flag in the original version. Lets see: The RCR instruction does but seems like it would be worse ... perhaps the SHRD instruction... It would be something like this (unable to test at the moment)

    xor     edx, edx     ;edx = 0 (will store the carry flag)
    bsr     rcx, rsi     ;rcx = lg(q) ... could be moved anywhere before shrd
    lea     rax, [rsi-1] ;rax = q-1 (adding p could carry)
    add     rax, rdi     ;rax += p  (handle carry)
    setc    dl           ;rdx = carry flag ... or xor rdx and setc
    shrd    rax, rdx, cl ;rax = rdx:rax >> cl
    ret

1 more instruction, but should be compatible with older processors (if it works ... I'm always getting a source/destination swapped - feel free to edit)

Addendum:

I've implemented the lg() function I said was available as follows:

inline uint64_t lg(uint64_t x) {
  return 63U - (uint64_t)__builtin_clzl(x);
}

inline uint32_t lg32(uint32_t x) {
  return 31U - (uint32_t)__builtin_clz(x);
}

The fast log functions don't fully optimize to bsr on clang and ICC but you can do this:

#if defined(__x86_64__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint64_t lg(uint64_t x){
  inline uint64_t ret;
  //other compilers may want bsrq here
  __asm__("bsr  %0, %1":"=r"(ret):"r"(x));
  return ret;
}
#endif

#if defined(__i386__) && (defined(__clang__) || defined(__INTEL_COMPILER))    
static inline uint32_t lg32(uint32_t x){
  inline uint32_t ret;
  __asm__("bsr  %0, %1":"=r"(ret):"r"(x));
  return ret;
}
#endif
BeeOnRope

There has already been a lot of human brainpower applied to this problem, with several variants of great answers in C along with Peter Cordes's answer which covers the best you could hope for in asm, with notes on trying to map it back to C.

So while the humans are having their kick at the can, I thought see what some brute computing power has to say! To that end, I used Stanford's STOKE superoptimizer to try to find good solutions to the 32-bit and 64-bit versions of this problem.

Usually, a superoptimizer is usually something like a brute force search through all possible instruction sequences until you find the best one by some metric. Of course, with something like 1,000 instructions that will quickly spiral out of control for more than a few instructions1. STOKE, on the hand, takes a guided randomized approach: it randomly makes mutations to an existing candidate program, evaluating at each step a cost function that takes both performance and correctness into effect. That's the one-liner anyway - there are plenty of papers if that stoked your curiosity.

So within a few minutes STOKE found some pretty interesting solutions. It found almost all the high-level ideas in the existing solutions, plus a few unique ones. For example, for the 32-bit function, STOKE found this version:

neg rsi                         
dec rdi                         
pext rax, rsi, rdi
inc eax
ret

It doesn't use any leading/trailing-zero count or shift instructions at all. Pretty much, it uses neg esi to turn the divisor into a mask with 1s in the high bits, and then pext effectively does the shift using that mask. Outside of that trick it's using the same trick that user QuestionC used: decrement p, shift, increment p - but it happens to work even for zero dividend because it uses 64-bit registers to distinguish the zero case from the MSB-set large p case.

I added the C version of this algorithm to the benchmark, and added it to the results. It's competitive with the other good algorithms, tying for first in the "Variable Q" cases. It does vectorize, but not as well as the other 32-bit algorithms, because it needs 64-bit math and so the vectors can process only half as many elements at once.

Even better, in the 32-bit case it came up with a variety of solutions which use the fact that some of the intuitive solutions that fail for edge cases happen to "just work" if you use 64-bit ops for part of it. For example:

tzcntl ebp, esi      
dec esi             
add rdi, rsi        
sarx rax, rdi, rbp
ret

That's the equivalent of the return (p + q - 1) >> lg(q) suggestion I mentioned in the question. That doesn't work in general since for large p + q it overflows, but for 32-bit p and q this solution works great by doing the important parts in 64-bit. Convert that back into C with some casts and it actually figures out that using lea will do the addition in one instruction1:

stoke_32(unsigned int, unsigned int):
        tzcnt   edx, esi
        mov     edi, edi          ; goes away when inlining
        mov     esi, esi          ; goes away when inlining
        lea     rax, [rsi-1+rdi]
        shrx    rax, rax, rdx
        ret

So it's a 3-instruction solution when inlined into something that already has the values zero-extended into rdi and rsi. The stand-alone function definition needs the mov instructions to zero-extend because that's how the SysV x64 ABI works.


For the 64-bit function it didn't come up with anything that blows away the existing solutions but it did come up with some neat stuff, like:

  tzcnt  r13, rsi      
  tzcnt  rcx, rdi      
  shrx   rax, rdi, r13 
  cmp    r13b, cl        
  adc    rax, 0        
  ret

That guy counts the trailing zeros of both arguments, and then adds 1 to the result if q has fewer trailing zeros than p, since that's when you need to round up. Clever!

In general, it understood the idea that you needed to shl by the tzcnt really quickly (just like most humans) and then came up with a ton of other solutions to the problem of adjusting the result to account for rounding. It even managed to use blsi and bzhi in several solutions. Here's a 5-instruction solution it came up with:

tzcnt r13, rsi                  
shrx  rax, rdi, r13             
imul  rsi, rax                   
cmp   rsi, rdi                    
adc   rax, 0                    
ret 

It's a basically a "multiply and verify" approach - take the truncated res = p \ q, multiply it back and if it's different than p add one: return res * q == p ? ret : ret + 1. Cool. Not really better than Peter's solutions though. STOKE seems to have some flaws in its latency calculation - it thinks the above has a latency of 5 - but it's more like 8 or 9 depending on the architecture. So it sometimes narrows in solutions that are based on its flawed latency calculation.


1 Interestingly enough though this brute force approach reaches its feasibility around 5-6 instructions: if you assume you can trim the instruction count to say 300 by eliminating SIMD and x87 instructions, then you would need ~28 days to try all 300 ^ 5 5 instruction sequences at 1,000,000 candidates/second. You could perhaps reduce that by a factor of 1,000 with various optimizations, meaning less than an hour for 5-instruction sequences and maybe a week for 6-instruction. As it happens, most of the best solutions for this problem fall into that 5-6 instruction window...

2 This will be a slow lea, however, so the sequence found by STOKE was still optimal for what I optimized for, which was latency.

You can do it like this, by comparing dividing n / d with dividing by (n-1) / d.

#include <stdio.h>

int main(void) {
    unsigned n;
    unsigned d;
    unsigned q1, q2;
    double actual;

    for(n = 1; n < 6; n++) {
        for(d = 1; d < 6; d++) {
            actual = (double)n / d;
            q1 = n / d;
            if(n) {
                q2 = (n - 1) / d;
                if(q1 == q2) {
                    q1++;
                }
            }
            printf("%u / %u = %u (%f)\n", n, d, q1, actual);
        }
    }

    return 0;
}

Program output:

1 / 1 = 1 (1.000000)
1 / 2 = 1 (0.500000)
1 / 3 = 1 (0.333333)
1 / 4 = 1 (0.250000)
1 / 5 = 1 (0.200000)
2 / 1 = 2 (2.000000)
2 / 2 = 1 (1.000000)
2 / 3 = 1 (0.666667)
2 / 4 = 1 (0.500000)
2 / 5 = 1 (0.400000)
3 / 1 = 3 (3.000000)
3 / 2 = 2 (1.500000)
3 / 3 = 1 (1.000000)
3 / 4 = 1 (0.750000)
3 / 5 = 1 (0.600000)
4 / 1 = 4 (4.000000)
4 / 2 = 2 (2.000000)
4 / 3 = 2 (1.333333)
4 / 4 = 1 (1.000000)
4 / 5 = 1 (0.800000)
5 / 1 = 5 (5.000000)
5 / 2 = 3 (2.500000)
5 / 3 = 2 (1.666667)
5 / 4 = 2 (1.250000)
5 / 5 = 1 (1.000000)

Update

I posted an early answer to the original question, which works, but did not consider the efficiency of the algorithm, or that the divisor is always a power of 2. Performing two divisions was needlessly expensive.

I am using MSVC 32-bit compiler on a 64-bit system, so there is no chance that I can provide a best solution for the required target. But it is an interesting question so I have dabbled around to find that the best solution will discover the bit of the 2**n divisor. Using the library function log2 worked but was so slow. Doing my own shift was much better, but still my best C solution is

unsigned roundup(unsigned p, unsigned q)
{
    return p / q + ((p & (q-1)) != 0);
}

My inline 32-bit assembler solution is faster, but of course that will not answer the question. I steal some cycles by assuming that eax is returned as the function value.

unsigned roundup(unsigned p, unsigned q)
{
    __asm {
        mov eax,p
        mov edx,q
        bsr ecx,edx     ; cl = bit number of q
        dec edx         ; q-1
        and edx,eax     ; p & (q-1)
        shr eax,cl      ; divide p by q, a power of 2
        sub edx,1       ; generate a carry when (p & (q-1)) == 0
        cmc
        adc eax,0       ; add 1 to result when (p & (q-1)) != 0
    }
}                       ; eax returned as function value

This seems efficient and works for signed if your compiler is using arithmetic right shifts (usually true).

#include <stdio.h>

int main (void)
{
    for (int i = -20; i <= 20; ++i) {
        printf ("%5d %5d\n", i, ((i - 1) >> 1) + 1);
    }
    return 0;
}

Use >> 2 to divide by 4, >> 3 to divide by 8, &ct. Efficient lg does the work there.

You can even divide by 1! >> 0

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!