avx2

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2

余生颓废 提交于 2020-03-12 05:14:04
问题 I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i . To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think. Edit: best/optimal in term of speed/cycle reduction. 回答1: (Related: if you're looking for the non-existant _mm512_reduce_add_epu8 , see Summing 8-bit integers in __m512i with AVX intrinsics; vpsadbw as an hsum within qwords is much

_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

安稳与你 提交于 2020-01-22 19:49:12
问题 In SSE3, the PALIGNR instruction performs the following: PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination. I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit. Naively, I believed that the

Disable AVX2 functions on non-Haswell processors

狂风中的少年 提交于 2020-01-22 12:59:50
问题 I have written some AVX2 code to run on a Haswell i7 processor. The same codebase is also used on non-Haswell processors, where the same code should be replaced with their SSE equivalents. I was wondering is there a way for the compiler to ignore AVX2 instructions on non-Haswell processors. I need something like: public void useSSEorAVX(...){ IF (compiler directive detected AVX2) AVX2 code (this part is ready) ELSE SSE code (this part is also ready) } } Right now I am commenting out related

Disable AVX2 functions on non-Haswell processors

瘦欲@ 提交于 2020-01-22 12:59:06
问题 I have written some AVX2 code to run on a Haswell i7 processor. The same codebase is also used on non-Haswell processors, where the same code should be replaced with their SSE equivalents. I was wondering is there a way for the compiler to ignore AVX2 instructions on non-Haswell processors. I need something like: public void useSSEorAVX(...){ IF (compiler directive detected AVX2) AVX2 code (this part is ready) ELSE SSE code (this part is also ready) } } Right now I am commenting out related

Find the first instance of a character using simd

£可爱£侵袭症+ 提交于 2020-01-10 02:59:06
问题 I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions? 回答1: You have the right idea

What's the fastest stride-3 gather instruction sequence?

。_饼干妹妹 提交于 2020-01-09 10:02:17
问题 The question: What is the most efficient sequence to generate a stride-3 gather of 32-bit elements from memory? If the memory is arranged as: MEM = R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 ... We want to obtain three YMM registers where: YMM0 = R0 R1 R2 R3 R4 R5 R6 R7 YMM1 = G0 G1 G2 G3 G4 G5 G6 G7 YMM2 = B0 B1 B2 B3 B4 B5 B6 B7 Motivation and discussion The scalar C code is something like template <typename T> T Process(const T* Input) { T Result = 0; for (int i=0; i < 4096; ++i) { T R = Input[3

What's the fastest stride-3 gather instruction sequence?

左心房为你撑大大i 提交于 2020-01-09 10:02:14
问题 The question: What is the most efficient sequence to generate a stride-3 gather of 32-bit elements from memory? If the memory is arranged as: MEM = R0 G0 B0 R1 G1 B1 R2 G2 B2 R3 G3 B3 ... We want to obtain three YMM registers where: YMM0 = R0 R1 R2 R3 R4 R5 R6 R7 YMM1 = G0 G1 G2 G3 G4 G5 G6 G7 YMM2 = B0 B1 B2 B3 B4 B5 B6 B7 Motivation and discussion The scalar C code is something like template <typename T> T Process(const T* Input) { T Result = 0; for (int i=0; i < 4096; ++i) { T R = Input[3

GCC couldn't vectorize 64-bit multiplication. Can 64-bit x 64-bit -> 128-bit widening multiplication be vectorized on AVX2?

感情迁移 提交于 2020-01-02 05:17:28
问题 I try to vectorize a CBRNG which uses 64bit widening multiplication. static __inline__ uint64_t mulhilo64(uint64_t a, uint64_t b, uint64_t* hip) { __uint128_t product = ((__uint128_t)a)*((__uint128_t)b); *hip = product>>64; return (uint64_t)product; } Is such a multiplication exists in a vectorized form in AVX2? 回答1: No. There's no 64 x 64 -> 128 bit arithmetic as a vector instruction. Nor is there a vector mulhi type instruction (high word result of multiply). [V]PMULUDQ can do 32 x 32 -> 64

How to clear the upper 128 bits of __m256 value?

旧街凉风 提交于 2020-01-02 01:07:12
问题 How can I clear the upper 128 bits of m2: __m256i m2 = _mm256_set1_epi32(2); __m128i m1 = _mm_set1_epi32(1); m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2)); m2 = _mm256_castsi128_si256(m1); don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”. At the same time I can easily do it in assembly: VMOVDQA xmm2, xmm2 //zeros upper ymm2 VMOVDQA xmm2, xmm1 Of course I'd not like to use "and" or _mm256

Why do processors with only AVX out-perform AVX2 processors for many SIMD algorithms?

大憨熊 提交于 2020-01-01 11:33:51
问题 I've been investigating the benefits of SIMD algorithms in C# and C++, and found that in many cases using 128-bit registers on an AVX processor offers a better improvement than using 256-bit registers on a processor with AVX2, but I don't understand why. By improvement I mean the speed-up of a SIMD algorithm relative to a non-SIMD algorithm on the same machine. 回答1: On an AVX processor, the upper half of the 256 bit registers and floating point units are powered down by the CPU when not