Is there any way to write “mod 31” without modulus/division operators?

后端 未结 5 999
我在风中等你
我在风中等你 2020-12-03 16:02

Getting the modulus of a number can be easily done without the modulus operator or divisions, if your operand is a power of 2. In that case, the following formula holds:

5条回答
  •  北海茫月
    2020-12-03 16:17

    [Edit2] below for performance notes

    An attempt with only 1 if condition.

    This approach is O(log2(sizeof unsigned)). Run time would increase by 1 set of ands/shifts/add rather than twice the time with a loop approach should code use uint64_t.

    unsigned mod31(uint32_t x) {
      #define m31 (31lu)
      #define m3131 ((m31 << 5) | m31)
      #define m31313131 ((m3131 << 10) | m3131)
    
      static const uint32_t mask1 = (m31 << 0) | (m31 << 10) | (m31 << 20) | (m31 << 30);
      static const uint32_t mask2 = (m31 << 5) | (m31 << 15) | (m31 << 25);
      uint32_t a = x & mask1;
      uint32_t b = x & mask2;
      x = a + (b >> 5);
      // x = xx 0000x xxxxx 0000x xxxxx 0000x xxxxx
    
      a = x & m31313131;
      b = x & (m31313131 << 20);
      x = a + (b >> 20);
      // x = 00 00000 00000 000xx xxxxx 000xx xxxxx
    
      a = x & m3131;
      b = x & (m3131 << 10);
      x = a + (b >> 10);
      // x = 00 00000 00000 00000 00000 00xxx xxxxx
    
      a = x & m31;
      b = x & (m31 << 5);
      x = a + (b >> 5);
      // x = 00 00000 00000 00000 00000 0000x xxxxx
    
      return x >= 31 ? x-31 : x;
    }
    

    [Edit]

    The first addition method sums the individual 7 groups of five bit in parallel. Subsequent additions bring the 7 group into 4, then 2, then 1. This final 7-bit sum then proceeds to add its upper half (2-bits) to its lower half(5-bits). Code then uses one test to perform the final "mod".

    This method scales for wider unsigned up to at least uint165_t log2(31+1)*(31+2). Pass that, a little more code is needed.

    See @rici for some good optimizations. Still recommend using uint32_t vs. unsigned and 31UL in shifts like 31U << 15 as an unsigned 31U may only be 16 bits long. (16 bit int popular in embedded world in 2014).


    [Edit2]

    Besides letting the compiler use its optimizer, 2 additional techniques sped performance. These are more minor parlor tricks that yielded a modest improvement. Keep in mind YMMV and this is for a 32-bit unsigned.

    Using a table look-up for the last modulo improved 10-20%. Using unsigned t table rather than unsigned char t helped a bit too. It turned out that table length, as first expected needed to be 2*31, only needed 31+5.

    Using a local variable rather than always calling the function parameter surprisingly helped. Likely a weakness in my gcc compiler.

    Found non-branching solutions, not shown, to replace x >= 31 ? x-31 : x. but their coding complexity was greater and performance was slower.

    All-in-all, a fun exercise.

    unsigned mod31quik(unsigned xx) {
      #define mask (31u | (31u << 10) | (31u << 20) | (31u << 30))
      unsigned x = (xx & mask) + ((xx >> 5) & mask);
      x += x >> 20;
      x += x >> 10;
      x = (x & 31u) + ((x >> 5) & 31u);
    
      static const unsigned char t[31 * 2 /* 36 */] = { 0, 1, 2, 3, 4, 5, 6,
          7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
          25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
      return t[x];
    }
    

提交回复
热议问题