Determine the minimum across SIMD lanes of __m256 value

匿名 (未验证) 提交于 2019-12-03 08:54:24

问题:

I understand that operations across SIMD lanes should generally be avoided. However, sometimes it has to be done.

I am using AVX2 intrinsics, and have 8 floating point values in an __m256.

I want to know the lowest value in this vector, and to complicate matters: also in which slot this was.

My current solution makes a round trip to memory, which I don't like:

float closestvals[8]; _mm256_store_ps( closestvals, closest8 );  float closest  = closestvals[0]; int closestidx = 0; for ( int k=1; k<8; ++k ) {     if ( closestvals[k] < closest )     {         closest = closestvals[ k ];         closestidx = k;     } } 

What would be a good way to do this without going to/from memory?

回答1:

You can try this:

#include <stdio.h> #include <x86intrin.h> #include <math.h> /*  gcc -O3 -Wall -m64 -march=haswell hor_min.c   */ int print_vec_ps(__m256 x);  int main() {     float x[8]={1.2f, 3.6f, 2.1f, 9.4f,   4.0f, 0.1f, 8.9f, 3.3f};      /* Note that the results are not useful if one of the inputs is a 'not a number'. The input below leads to indx = 32 (!)     */ //    float x[8]={1.2f, 3.6f, 2.1f, NAN,  4.0f, 2.0f , 8.9f, 3.3f};      __m256 v0    = _mm256_load_ps(x);                /* _mm256_shuffle_ps instead of _mm256_permute_ps is also possible, see Peter Cordes' comments */     __m256 v1    = _mm256_permute_ps(v0,0b10110001); /* swap floats: 0<->1, 2<->3, 4<->5, 6<->7                         */         __m256 v2    = _mm256_min_ps(v0,v1);     __m256 v3    = _mm256_permute_ps(v2,0b01001110); /* swap floats                                                     */         __m256 v4    = _mm256_min_ps(v2,v3);     __m256 v5    = _mm256_castpd_ps(_mm256_permute4x64_pd(_mm256_castps_pd(v4),0b01001110)); /* swap 128-bit lanes      */     __m256 v_min = _mm256_min_ps(v4,v5);     __m256 mask  = _mm256_cmp_ps(v0,v_min,0);     int    indx  = _tzcnt_u32(_mm256_movemask_ps(mask));      printf("             7      6      5      4      3      2      1      0\n");    printf("v0     = ");print_vec_ps(v0    );    printf("v1     = ");print_vec_ps(v1    );    printf("v2     = ");print_vec_ps(v2    );    printf("\nv3     = ");print_vec_ps(v3    );    printf("v4     = ");print_vec_ps(v4    );    printf("\nv5     = ");print_vec_ps(v5    );    printf("v_min  = ");print_vec_ps(v_min );    printf("mask   = ");print_vec_ps(mask  );    printf("indx   = ");printf("%d\n",indx);     return 0; }   int print_vec_ps(__m256 x){    float v[8];    _mm256_storeu_ps(v,x);    printf("%5.2f  %5.2f  %5.2f  %5.2f  %5.2f  %5.2f  %5.2f  %5.2f\n",           v[7],v[6],v[5],v[4],v[3],v[2],v[1],v[0]);    return 0; } 

Output:

./a.out              7      6      5      4      3      2      1      0 v0     =  3.30   8.90   0.10   4.00   9.40   2.10   3.60   1.20 v1     =  8.90   3.30   4.00   0.10   2.10   9.40   1.20   3.60 v2     =  3.30   3.30   0.10   0.10   2.10   2.10   1.20   1.20  v3     =  0.10   0.10   3.30   3.30   1.20   1.20   2.10   2.10 v4     =  0.10   0.10   0.10   0.10   1.20   1.20   1.20   1.20  v5     =  1.20   1.20   1.20   1.20   0.10   0.10   0.10   0.10 v_min  =  0.10   0.10   0.10   0.10   0.10   0.10   0.10   0.10 mask   =  0.00   0.00   -nan   0.00   0.00   0.00   0.00   0.00 indx   = 5 

In the previous version of this answer, the 128-bit lanes were swapped with _mm256_permute2f128_ps. In this updated answer _mm256_permute2f128_ps is replaced by _mm256_permute4x64_pd, which is faster on AMD CPUs and on Intel KNL, see @Peter Cordes' comments. But note that _mm256_permute4x64_pd requires AVX2, while AVX is sufficient for _mm256_permute2f128_ps.

Also note that the results of this code are useless if one of the input values is a 'not a number' (NAN).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!