Emulating variable bit-shift using only constant shifts?

后端 未结 8 1262
清歌不尽
清歌不尽 2020-12-09 19:19

I\'m trying to find a way to perform an indirect shift-left/right operation without actually using the variable shift op or any branches.

The particular PowerPC pro

8条回答
  •  Happy的楠姐
    2020-12-09 19:46

    If the shift count can be calculated far in advance then I have two ideas that might work

    • Using self-modifying code

      Just modify the shift amount immediate in the instruction. Alternatively generate code dynamically for the functions with variable shift

    • Group the values with the same shift count together if possible, and do the operation all at once using Duff's device or function pointer to minimize branch misprediction

      // shift by constant functions
      typedef int (*shiftFunc)(int);    // the shift function
      #define SHL(n) int shl##n(int x) { return x << (n); }
      SHL(1)
      SHL(2)
      SHL(3)
      ...
      shiftFunc shiftLeft[] = { shl1, shl2, shl3... };
      
      int arr[MAX];       // all the values that need to be shifted with the same amount
      shiftFunc shl = shiftLeft[3]; // when you want to shift by 3
      for (int i = 0; i < MAX; i++)
          arr[i] = shl(arr[i]);
      

      This method might also be done in combination with self-modifying or run-time code generation to remove the need for a function pointer.

      Edit: As commented, unfortunately there's no branch prediction on jump to register at all, so the only way this could work is generating code as I said above, or using SIMD


    If the range of the values is small, lookup table is another possible solution

    #define S(x, n) ((x) + 0) << (n), ((x) + 1) << (n), ((x) + 2) << (n), ((x) + 3) << (n), \
                    ((x) + 4) << (n), ((x) + 5) << (n), ((x) + 6) << (n), ((x) + 7 << (n)
    #define S2(x, n)    S((x + 0)*8, n), S((x + 1)*8, n), S((x + 2)*8, n), S((x + 3)*8, n), \
                        S((x + 4)*8, n), S((x + 5)*8, n), S((x + 6)*8, n), S((x + 7)*8, n)
    uint8_t shl[256][8] = {
        { S2(0U, 0), S2(8U, 0), S2(16U, 0), S2(24U, 0) },
        { S2(0U, 1), S2(8U, 1), S2(16U, 1), S2(24U, 1) },
        ...
        { S2(0U, 7), S2(8U, 7), S2(16U, 7), S2(24U, 7) },
    }
    

    Now x << n is simply shl[x][n] with x being an uint8_t. The table costs 2KB (8 × 256 B) of memory. However for 16-bit values you'll need a 1MB table (16 × 64 KB), which may still be viable and you can do a 32-bit shift by combining two 16-bit shifts together

提交回复
热议问题