avx | 易学教程

Fastest way to multiply an array of int64_t?

阅读更多关于 Fastest way to multiply an array of int64_t?

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the high-half result of each multiplication. void multiply_vex(long *Gi_vec, long q, long *Gj_vec){ int i; __m256i data_j, data_i; __uint64_t *ptr_J = (__uint64_t*)&data_j; __uint64_t *ptr_I = (__uint64_t*)&data_i; for (i=0; i<BASE_VEX_STOP; i+=4) { data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]); data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]); ptr_I[0] -=

Convention for displaying vector registers

阅读更多关于 Convention for displaying vector registers

问题 Is there a convention for displaying/writing large registers, like those available in the Intel AVX instruction set? For example, if you have 1 in the least significant byte, and 20 in the most significant byte, and 0 elsewhere in an xmm register, for a byte-wise display is the following preferred (little-endian): [1, 0, 0, 0, ..., 0, 20] or is this preferred: [20, 0, 0, 0, ..., 0, 1] Similarly, when displaying such registers as made up of larger data items, is the same rule applied? E.g., to

Fastest way to set __m256 value to all ONE bits

阅读更多关于 Fastest way to set __m256 value to all ONE bits

问题 How can I set a value of 1 to all bits in an __m256 value? Using either AVX or AVX2 intrinsics? To get all zeros, you can use _mm256_setzero_si256() . To get all ones, I'm currently using _mm256_set1_epi64x(-1) , but I suspect that this is slower than the all-zero case. Is there memory access or Salar/SSE/AVX switching involved here? And I can't seem to find a simple bitwise NOT operation in AVX? If that was available, I could simply use the setzero, followed by a vector NOT. 回答1: See also

unresolved external symbol __mm256_setr_epi64x

阅读更多关于 unresolved external symbol __mm256_setr_epi64x

问题 I have written and debugged some AVX code with g++ and now I'm trying to get it to work with MSVC, but I keep getting error LNK2019: unresolved external symbol __mm256_setr_epi64x referenced in function "private: union __m256i __thiscall avx_matrix::avx_bit_mask(unsigned int)const " (?avx_bit_mask@avx_matrix@@ABE?AT__m256i@@I@Z) The referenced piece of code is ... #include <immintrin.h> ... /* All zeros except for pos-th position (0..255) */ __m256i avx_matrix::avx_bit_mask(const std::size_t

Using ymm registers as a “memory-like” storage location

阅读更多关于 Using ymm registers as a “memory-like” storage location

问题 Consider the following loop in x86: ; on entry, rdi has the number of iterations .top: ; some magic happens here to calculate a result in rax mov [array + rdi * 8], rax ; store result in output array dec rdi jnz .top It's straightforward: something calculates a result in rax (not shown) and then we store the result into an array, in reverse order as we index with rdi . I would like to transform the above loop not make any writes to memory (we can assume the non-shown calculation doesn't write

How to find the horizontal maximum in a 256-bit AVX vector

阅读更多关于 How to find the horizontal maximum in a 256-bit AVX vector

问题 I have a __m256d vector packed with four 64-bit floating-point values. I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value; My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

阅读更多关于 Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

I am computing eight dot products at once with AVX. In my current code I do something like this (before unrolling): Ivy-Bridge/Sandy-Bridge __m256 areg0 = _mm256_set1_ps(a[m]); for(int i=0; i<n; i++) { __m256 breg0 = _mm256_load_ps(&b[8*i]); tmp0 = _mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0); } Haswell __m256 areg0 = _mm256_set1_ps(a[m]); for(int i=0; i<n; i++) { __m256 breg0 = _mm256_load_ps(&b[8*i]); tmp0 = _mm256_fmadd_ps(arge0, breg0, tmp0); } How many times do I need to unroll the loop for each case to ensure maximum throughput? For Haswell using FMA3 I think the answer is here FLOPS

How are the gather instructions in AVX2 implemented?

阅读更多关于 How are the gather instructions in AVX2 implemented?

问题 Suppose I'm using AVX2's VGATHERDPS - this should load 8 single-precision floats using 8 DWORD indices. What happens when the data to be loaded exists in different cache-lines? Is the instruction implemented as a hardware loop which fetches cache-lines one by one? Or, can it issue a load to multiple cache-lines at once? I read a couple of papers which state the former (and that's the one which makes more sense to me), but I would like to know a bit more about this. Link to one paper: http:/

Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?

阅读更多关于 Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?

问题 For multiplying large binary matrices (10Kx20K), what I usually to do is to convert the matrices to float ones and perform float matrix multiplication as integer matrix multiplication is pretty slow (have a look at here). This time though, I'd need to perform over hundred thousands of these multiplications and even a millisecond performance improvement on average matters to me . I want an int or float matrix as a result, because the product may have elements that aren't 0 or 1. The input

cpu dispatcher for visual studio for AVX and SSE

阅读更多关于 cpu dispatcher for visual studio for AVX and SSE

I work with two computers. One without AVX support and one with AVX. It would be convenient to have my code find the instruction set supported by my CPU at run-time and choose the appropriate code path. I've follow the suggestions by Agner Fog to make a CPU dispatcher ( http://www.agner.org/optimize/#vectorclass ). However, on my maching ithout AVX compiling and linking with visual studio the code with AVX enabled causes the code to crash when I run it. I mean for example I have two source files one with the SSE2 instruction set defined with some SSE2 instructions and another one with the AVX