Is there any way to write “mod 31” without modulus/division operators?

后端未结

关注

 5  999

我在风中等你 2020-12-03 16:02

Getting the modulus of a number can be easily done without the modulus operator or divisions, if your operand is a power of 2. In that case, the following formula holds:

5条回答

北海茫月 (楼主)

2020-12-03 16:17
[Edit2] below for performance notes

An attempt with only 1 if condition.

This approach is O(log2(sizeof unsigned)). Run time would increase by 1 set of ands/shifts/add rather than twice the time with a loop approach should code use uint64_t.
```
unsigned mod31(uint32_t x) {
  #define m31 (31lu)
  #define m3131 ((m31 << 5) | m31)
  #define m31313131 ((m3131 << 10) | m3131)

  static const uint32_t mask1 = (m31 << 0) | (m31 << 10) | (m31 << 20) | (m31 << 30);
  static const uint32_t mask2 = (m31 << 5) | (m31 << 15) | (m31 << 25);
  uint32_t a = x & mask1;
  uint32_t b = x & mask2;
  x = a + (b >> 5);
  // x = xx 0000x xxxxx 0000x xxxxx 0000x xxxxx

  a = x & m31313131;
  b = x & (m31313131 << 20);
  x = a + (b >> 20);
  // x = 00 00000 00000 000xx xxxxx 000xx xxxxx

  a = x & m3131;
  b = x & (m3131 << 10);
  x = a + (b >> 10);
  // x = 00 00000 00000 00000 00000 00xxx xxxxx

  a = x & m31;
  b = x & (m31 << 5);
  x = a + (b >> 5);
  // x = 00 00000 00000 00000 00000 0000x xxxxx

  return x >= 31 ? x-31 : x;
}
```
[Edit]

The first addition method sums the individual 7 groups of five bit in parallel. Subsequent additions bring the 7 group into 4, then 2, then 1. This final 7-bit sum then proceeds to add its upper half (2-bits) to its lower half(5-bits). Code then uses one test to perform the final "mod".

This method scales for wider unsigned up to at least uint165_t log2(31+1)*(31+2). Pass that, a little more code is needed.

See @rici for some good optimizations. Still recommend using uint32_t vs. unsigned and 31UL in shifts like 31U << 15 as an unsigned 31U may only be 16 bits long. (16 bit int popular in embedded world in 2014).

[Edit2]

Besides letting the compiler use its optimizer, 2 additional techniques sped performance. These are more minor parlor tricks that yielded a modest improvement. Keep in mind YMMV and this is for a 32-bit unsigned.

Using a table look-up for the last modulo improved 10-20%. Using unsigned t table rather than unsigned char t helped a bit too. It turned out that table length, as first expected needed to be 2*31, only needed 31+5.

Using a local variable rather than always calling the function parameter surprisingly helped. Likely a weakness in my gcc compiler.

Found non-branching solutions, not shown, to replace x >= 31 ? x-31 : x. but their coding complexity was greater and performance was slower.

All-in-all, a fun exercise.
```
unsigned mod31quik(unsigned xx) {
  #define mask (31u | (31u << 10) | (31u << 20) | (31u << 30))
  unsigned x = (xx & mask) + ((xx >> 5) & mask);
  x += x >> 20;
  x += x >> 10;
  x = (x & 31u) + ((x >> 5) & 31u);

  static const unsigned char t[31 * 2 /* 36 */] = { 0, 1, 2, 3, 4, 5, 6,
      7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
      25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
  return t[x];
}
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...