Sparse array compression using SIMD (AVX2)

后端未结

关注

 3  1170

粉色の甜心 2020-12-17 02:19

I have a sparse array a (mostly zeroes):

unsigned char a[1000000];

and I would like to create an array b of inde

3条回答

天涯浪人 (楼主)

2020-12-17 03:16

Five methods to compute the indices of the nonzeros are:

Semi vectorized loop: Load a SIMD vector with chars, compare with zero and apply a movemask. Use a small scalar loop if any of the chars is nonzero (also suggested by @stgatilov). This works well for very sparse arrays. Function arr2ind_movmsk in the code below uses BMI1 instructions for the scalar loop.
Vectorized loop: Intel Haswell processors and newer support the BMI1 and BMI2 instruction sets. BMI2 contains the pext instruction (Parallel bits extract, see wikipedia link), which turns out to be useful here. See arr2ind_pext in the code below.
Classic scalar loop with if statement: arr2ind_if.
Scalar loop without branches: arr2ind_cmov.
Lookup table: @stgatilov shows that it is possible to use a lookup table instead of the pdep and other integer instructions. This might work well, however, the lookup table is quite large: it doesn't fit in the L1 cache. Not tested here. See also the discussion here.

/* 
gcc -O3 -Wall -m64 -mavx2 -fopenmp -march=broadwell -std=c99 -falign-loops=16 sprs_char2ind.c

example: Test different methods with an array a of size 20000 and approximate 25/1024*100%=2.4% nonzeros:   
              ./a.out 20000 25
*/

#include 
#include 
#include 
#include  
#include 


__attribute__ ((noinline)) int arr2ind_movmsk(const unsigned char * restrict a, int n, int * restrict ind, int * m){
   int i, m0, k;
   __m256i msk;
   m0=0;
   for (i=0;i>2;                             /* p is the number of nonzeros in 16 bytes of a.                                                */
      uint64_t  cntr       = _pext_u64(cntr_const,msk64);                          /* parallel bits extract. cntr contains p 4-bit integers. The 16 4-bit integers in cntr_const are shuffled to the p 4-bit integers that we want */
                                                                                   /* The next 7 intrinsics unpack these p 4-bit integers to p 32-bit integers.                    */  
      __m256i   cntr256    = _mm256_set1_epi64x(cntr);
                cntr256    = _mm256_srlv_epi64(cntr256,shft);
                cntr256    = _mm256_and_si256(cntr256,vmsk);
      __m256i   cntr256_lo = _mm256_shuffle_epi8(cntr256,shf_lo);
      __m256i   cntr256_hi = _mm256_shuffle_epi8(cntr256,shf_hi);
                cntr256_lo = _mm256_add_epi32(i_vec,cntr256_lo);
                cntr256_hi = _mm256_add_epi32(i_vec,cntr256_hi);

                             _mm256_storeu_si256((__m256i *)&ind[m0],cntr256_lo);     /* Note that the stores of iteration i and i+16 may overlap.                                                         */
                             _mm256_storeu_si256((__m256i *)&ind[m0+8],cntr256_hi);   /* Array ind has to be large enough to avoid segfaults. At most 16 integers are written more than strictly necessary */ 
                m0         = m0+p;
                i_vec      = _mm256_add_epi32(i_vec,cnst16);
   }
   *m=m0;
   return 0;
}


__attribute__ ((noinline)) int arr2ind_if(const unsigned char * restrict a, int n, int * restrict ind, int * m){
   int i, m0;
   m0=0;
   for (i=0;i>31))^(ind[i]);
   }
   printf("chk = %10X\n",chk);
   return 0;
}



int main(int argc, char **argv){
int n, i, m; 
unsigned int j, k, d;
unsigned char *a;
int *ind;
double t0,t1;
int meth, nrep;
char txt[30];

sscanf(argv[1],"%d",&n);            /* Length of array a.                                    */
n=n>>5;                             /* Adjust n to a multiple of 32.                         */
n=n<<5;
sscanf(argv[2],"%u",&d);            /* The approximate fraction of nonzeros in a is: d/1024  */
printf("n=%d,   d=%u\n",n,d);

a=_mm_malloc(n*sizeof(char),32);
ind=_mm_malloc(n*sizeof(int),32);    

                                    /* Generate a pseudo random array a.                     */
j=73659343;                   
for (i=0;i>8;              /* k is a pseudo random number between 0 and 1023        */
   if (k






The code was tested with array size of n=10016 (the data fits in L1 cache) and n=1000000, with 
different nonzero densities of about 0.5%, 5% and 50%. For accurate timing the functions were called 1000000 
and 10000 times, respectively.




Time in seconds, size n=10016, 1e6 function calls. Intel core i5-6500
                     0.53%        5.1%       50.0%
arr2ind_movmsk:       0.27        0.53        4.89
arr2ind_pext:         1.44        1.59        1.45
arr2ind_if:           5.93        8.95       33.82
arr2ind_cmov:         6.82        6.83        6.82

Time in seconds, size n=1000000, 1e4 function calls.

                     0.49%        5.1%       50.1%
arr2ind_movmsk:       0.57        2.03        5.37
arr2ind_pext:         1.47        1.47        1.46
arr2ind_if:           5.88        8.98       38.59
arr2ind_cmov:         6.82        6.81        6.81





In these examples the vectorized loops are faster than the scalar loops. 
The performance of arr2ind_movmsk depends a lot on the density of a. It is only
faster than arr2ind_pext if the density is sufficiently small. The break-even point also depends on the array size n.
Function 'arr2ind_if' clearly suffers from failing branch prediction at 50% nonzero density.