Fastest way to multiply an array of int64_t?
I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the high-half result of each multiplication. void multiply_vex(long *Gi_vec, long q, long *Gj_vec){ int i; __m256i data_j, data_i; __uint64_t *ptr_J = (__uint64_t*)&data_j; __uint64_t *ptr_I = (__uint64_t*)&data_i; for (i=0; i<BASE_VEX_STOP; i+=4) { data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]); data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]); ptr_I[0] -=