AVX2 what is the most efficient way to pack left based on a mask?

后端未结

关注

 5  1222

不知归路 2020-11-22 06:37

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in

5条回答

Happy的楠姐 (楼主)

2020-11-22 06:47

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.

Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.

static inline __m128 LeftPack_SSE2(__m128 val, int mask)  {
  switch(mask) {
  case  0:
  case  1: return val;
  case  2: return _mm_shuffle_ps(val,val,0x01);
  case  3: return val;
  case  4: return _mm_shuffle_ps(val,val,0x02);
  case  5: return _mm_shuffle_ps(val,val,0x08);
  case  6: return _mm_shuffle_ps(val,val,0x09);
  case  7: return val;
  case  8: return _mm_shuffle_ps(val,val,0x03);
  case  9: return _mm_shuffle_ps(val,val,0x0c);
  case 10: return _mm_shuffle_ps(val,val,0x0d);
  case 11: return _mm_shuffle_ps(val,val,0x34);
  case 12: return _mm_shuffle_ps(val,val,0x0e);
  case 13: return _mm_shuffle_ps(val,val,0x38);
  case 14: return _mm_shuffle_ps(val,val,0x39);
  case 15: return val;
  }
}

__m128 foo(__m128 val, __m128 maskv) {
  int mask = _mm_movemask_ps(maskv);
  return LeftPack_SSE2(val, mask);
}

0 讨论(0)

查看其它5个回答