Emulating variable bit-shift using only constant shifts?

后端未结

关注

 8  1262

清歌不尽 2020-12-09 19:19

I\'m trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.

The particular PowerPC pro

8条回答

Happy的楠姐 (楼主)

2020-12-09 19:46
If the shift count can be calculated far in advance then I have two ideas that might work
- Using self-modifying code
  
  Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift
- Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction
```
// shift by constant functions
typedef int (*shiftFunc)(int);    // the shift function
#define SHL(n) int shl##n(int x) { return x << (n); }
SHL(1)
SHL(2)
SHL(3)
...
shiftFunc shiftLeft[] = { shl1, shl2, shl3... };

int arr[MAX];       // all the values that need to be shifted with the same amount
shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
for (int i = 0; i < MAX; i++)
    arr[i] = shl(arr[i]);
```
  This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.
  
  Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD
If the range of the values is small, lookup table is another possible solution
```
#define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
                ((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
#define S2(x, n)    S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
                    S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
uint8_t shl[256][8] = {
    { S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
    { S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
    ...
    { S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
}
```
Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...