avx | 易学教程

Shifting 4 integers right by different values SIMD

阅读更多关于 Shifting 4 integers right by different values SIMD

问题 SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this. i[0] = i[0] & 0b111111; i[1] = (i[1]>>6) & 0b111111; i[2] = (i[2]>>12) & 0b111111; i[3] = (i[3]>>18) & 0b111111; Essentially trying to isolate a different group of 6 bits in each integer. So what is the optimal solution? Things I thought about: You can simulate a variable

The indices of non-zero bytes of an SSE/AVX register

阅读更多关于 The indices of non-zero bytes of an SSE/AVX register

问题 If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements? For example, if xmm value is | r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 | the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array. If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value

developing for new instruction sets

阅读更多关于 developing for new instruction sets

问题 Intel is set to release a new instruction set called AVX, which includes an extension of SSE to 256-bit operation. That is, either 4 double-precision elements or 8 single-precision elements. How would one go about developing code for AVX, considering there's no hardware out there that supports it yet? More generally, how can developers write code for hardware that doesn't exist, for instance if they want to have software ready when the supporting CPU is released? 回答1: Maybe I'm missing

How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

阅读更多关于 How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

问题 This question already has answers here : Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 (5 answers) Closed last month . Update: Please read the code, it is NOT about counting bits in one int Is it possible to improve performance of the following code with some clever assembler? uint bit_counter[64]; void Count(uint64 bits) { bit_counter[0] += (bits >> 0) & 1; bit_counter[1] += (bits >> 1) & 1; // .. bit_counter[63] += (bits >> 63) & 1; } Count is in the

Do 128bit cross lane operations in AVX512 give better performance?

阅读更多关于 Do 128bit cross lane operations in AVX512 give better performance?

问题 In designing forward looking algorithms for AVX256, AVX512 and one day AVX1024 and considering the potential implementation complexity/cost of fully generic permutes for large SIMD width I wondered if it is better to generally keep to isolated 128bit operations even within AVX512? Especially given that AVX had 128bit units to execute 256bit operations. To that end I wanted to know if there was a performance difference between AVX512 permute type operations across all of the 512bit vector as

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

阅读更多关于 Fastest way to do horizontal vector sum with AVX instructions [duplicate]

问题 This question already has answers here : Get sum of values stored in __m256d with SSE/AVX (2 answers) Closed 11 months ago . I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum, v_sum); Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I

Fastest way to multiply an array of int64_t?

阅读更多关于 Fastest way to multiply an array of int64_t?

问题 I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this? Note: I don't want to save the high-half result of each multiplication. void multiply_vex(long *Gi_vec, long q, long *Gj_vec){ int i; __m256i data_j, data_i; __uint64_t *ptr_J = (__uint64_t*)&data_j; __uint64_t *ptr_I = (__uint64_t*)&data_i; for (i=0; i<BASE_VEX_STOP; i+=4) { data_i =

How to efficiently perform double/int64 conversions with SSE/AVX?

阅读更多关于 How to efficiently perform double/int64 conversions with SSE/AVX?

问题 SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing: _mm_cvtpd_epi64() _mm_cvtepi64_pd() It seems that AVX doesn't have them either. What is the most efficient way to simulate these intrinsics? 回答1: There's no single instruction until AVX512 , which added conversion to/from 64-bit integers, signed or unsigned.

What's missing/sub-optimal in this memcpy implementation?

阅读更多关于 What's missing/sub-optimal in this memcpy implementation?

问题 I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation: __forceinline // Since Size is usually known, // most useless code will be optimized out // if the function is inlined. void* myMemcpy(char* Dst, const char* Src, size_t Size) { void* start = Dst; for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) ) { __m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++)

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

阅读更多关于 How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

问题 The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1 . After manipulating the mask using bit operations ( BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8 , i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask . What is the best way to do this? Edit: I need to perform the inverse because the