AVX2 what is the most efficient way to pack left based on a mask?

后端 未结 5 1222
不知归路
不知归路 2020-11-22 06:37

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in

5条回答
  •  Happy的楠姐
    2020-11-22 06:47

    In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.

    Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.

    static inline __m128 LeftPack_SSE2(__m128 val, int mask)  {
      switch(mask) {
      case  0:
      case  1: return val;
      case  2: return _mm_shuffle_ps(val,val,0x01);
      case  3: return val;
      case  4: return _mm_shuffle_ps(val,val,0x02);
      case  5: return _mm_shuffle_ps(val,val,0x08);
      case  6: return _mm_shuffle_ps(val,val,0x09);
      case  7: return val;
      case  8: return _mm_shuffle_ps(val,val,0x03);
      case  9: return _mm_shuffle_ps(val,val,0x0c);
      case 10: return _mm_shuffle_ps(val,val,0x0d);
      case 11: return _mm_shuffle_ps(val,val,0x34);
      case 12: return _mm_shuffle_ps(val,val,0x0e);
      case 13: return _mm_shuffle_ps(val,val,0x38);
      case 14: return _mm_shuffle_ps(val,val,0x39);
      case 15: return val;
      }
    }
    
    __m128 foo(__m128 val, __m128 maskv) {
      int mask = _mm_movemask_ps(maskv);
      return LeftPack_SSE2(val, mask);
    }
    

提交回复
热议问题