avx

Fastest way to multiply an array of int64_t?

丶灬走出姿态 提交于 2019-11-27 08:22:28
I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the high-half result of each multiplication. void multiply_vex(long *Gi_vec, long q, long *Gj_vec){ int i; __m256i data_j, data_i; __uint64_t *ptr_J = (__uint64_t*)&data_j; __uint64_t *ptr_I = (__uint64_t*)&data_i; for (i=0; i<BASE_VEX_STOP; i+=4) { data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]); data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]); ptr_I[0] -=

Convention for displaying vector registers

[亡魂溺海] 提交于 2019-11-27 08:19:29
问题 Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set? For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm register, for a byte-wise display is the following preferred (little-endian): [1, 0, 0, 0, ..., 0, 20] or is this preferred: [20, 0, 0, 0, ..., 0, 1] Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to

Fastest way to set __m256 value to all ONE bits

跟風遠走 提交于 2019-11-27 07:24:11
问题 How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics? To get all zeros, you can use _mm256_setzero_si256() . To get all ones, I'm currently using _mm256_set1_epi64x(-1) , but I suspect that this is slower than the all-zero case. Is there memory access or Salar/SSE/AVX switching involved here? And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT. 回答1: See also

unresolved external symbol __mm256_setr_epi64x

旧城冷巷雨未停 提交于 2019-11-27 07:00:48
问题 I have written and debugged some AVX code with g++ and now I'm trying to get it to work with MSVC, but I keep getting error LNK2019: unresolved external symbol __mm256_setr_epi64x referenced in function "private: union __m256i __thiscall avx_matrix::avx_bit_mask(unsigned int)const " (?avx_bit_mask@avx_matrix@@ABE?AT__m256i@@I@Z) The referenced piece of code is ... #include <immintrin.h> ... /* All zeros except for pos-th position (0..255) */ __m256i avx_matrix::avx_bit_mask(const std::size_t

Using ymm registers as a “memory-like” storage location

痞子三分冷 提交于 2019-11-27 06:59:06
问题 Consider the following loop in x86: ; on entry, rdi has the number of iterations .top: ; some magic happens here to calculate a result in rax mov [array + rdi * 8], rax ; store result in output array dec rdi jnz .top It's straightforward: something calculates a result in rax (not shown) and then we store the result into an array, in reverse order as we index with rdi . I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write

How to find the horizontal maximum in a 256-bit AVX vector

非 Y 不嫁゛ 提交于 2019-11-27 06:46:26
问题 I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

混江龙づ霸主 提交于 2019-11-27 06:42:25
I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling): Ivy-Bridge/Sandy-Bridge __m256 areg0 = _mm256_set1_ps(a[m]); for(int i=0; i<n; i++) { __m256 breg0 = _mm256_load_ps(&b[8*i]); tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); } Haswell __m256 areg0 = _mm256_set1_ps(a[m]); for(int i=0; i<n; i++) { __m256 breg0 = _mm256_load_ps(&b[8*i]); tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0); } How many times do I need to unroll the loop for each case to ensure maximum throughput? For Haswell using FMA3 I think the answer is here FLOPS

How are the gather instructions in AVX2 implemented?

岁酱吖の 提交于 2019-11-27 05:36:30
问题 Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches cache-lines one by one? Or, can it issue a load to multiple cache-lines at once? I read a couple of papers which state the former (and that's the one which makes more sense to me), but I would like to know a bit more about this. Link to one paper: http:/

Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?

一曲冷凌霜 提交于 2019-11-27 04:42:33
问题 For multiplying large binary matrices (10Kx20K), what I usually to do is to convert the matrices to float ones and perform float matrix multiplication as integer matrix multiplication is pretty slow (have a look at here). This time though, I'd need to perform over hundred thousands of these multiplications and even a millisecond performance improvement on average matters to me . I want an int or float matrix as a result, because the product may have elements that aren't 0 or 1. The input

cpu dispatcher for visual studio for AVX and SSE

≯℡__Kan透↙ 提交于 2019-11-27 04:34:05
I work with two computers. One without AVX support and one with AVX. It would be convenient to have my code find the instruction set supported by my CPU at run-time and choose the appropriate code path. I've follow the suggestions by Agner Fog to make a CPU dispatcher ( http://www.agner.org/optimize/#vectorclass ). However, on my maching ithout AVX compiling and linking with visual studio the code with AVX enabled causes the code to crash when I run it. I mean for example I have two source files one with the SSE2 instruction set defined with some SSE2 instructions and another one with the AVX