simd | 易学教程

How to add an AVX2 vector horizontally 3 by 3?

阅读更多关于 How to add an AVX2 vector horizontally 3 by 3?

问题 I have a __m256i vector containing 16x16-bit elements.I want to apply a three adjacent horizontal addition on it. In scalar mode I use the following code: unsigned short int temp[16]; __m256i sum_v;//has some values. 16 elements of 16-bit vector. | 0 | x15 | x14 | x13 | ... | x3 | x2 | x1 | _mm256_store_si256((__m256i *)&temp[0], sum_v); output1 = (temp[0] + temp[1] + temp[2]); output2 = (temp[3] + temp[4] + temp[5]); output3 = (temp[6] + temp[7] + temp[8]); output4 = (temp[9] + temp[10] +

Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, “vfmadd132pd”, “231” and “213”?

阅读更多关于 Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, “vfmadd132pd”, “231” and “213”?

问题 Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd , vfmadd231pd and vfmadd213pd , while there is only one C intrinsics _mm256_fmadd_pd ? To make things simple, what is the difference between (in AT&T syntax) vfmadd132pd %ymm0, %ymm1, %ymm2 vfmadd231pd %ymm0, %ymm1, %ymm2 vfmadd213pd %ymm0, %ymm1, %ymm2 I did not get any idea from Intel's intrinsics guide. I ask because I see all of them in the assembler output of a chunk of C code I

How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

阅读更多关于 How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

问题 This question already has answers here : Load address calculation when using AVX2 gather instructions (3 answers) Closed last year . Intel's Intrinsic Guide says: __m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale) And: Description Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are

Strange compiler behavior when inlining(ASM code included)

阅读更多关于 Strange compiler behavior when inlining(ASM code included)

问题 My problem is, that the compiler chooses not to inline a function in a specific case, thus making the code a LOT slower.The function is supposed to compute the dot product for a vector(SIMD accelerated).I have it written in two different styles: The Vector class aggregates a __m128 member. The Vector is just a typedef of the __m128 member. In case 1 I get 2 times slower code, the function doesn't inline. In case 2 I get optimal code, very fast, inlined. In case 1 the Vector and the Dot

Border check in image processing

阅读更多关于 Border check in image processing

问题 I want to take care the border conditions while handling any filters in image processing .I am extrapolating the border and creating the new boundary.For example I am having 4x3 input : //Input int image[4][3] = 1 2 3 4 2 4 6 8 3 6 9 12 //Output int extensionimage[6][5] = 1 1 2 3 4 4 1 1 2 3 4 4 2 2 4 6 8 8 3 3 6 9 12 12 3 3 6 9 12 12 My code : #include <stdio.h> #include <string.h> #include <stdlib.h> void padd_border(int *img,int *extension,int width,int height); int main(){ int width = 4

OpenCV 4.0 Android (NDK) With NEON: Why And Why Not?

阅读更多关于 OpenCV 4.0 Android (NDK) With NEON: Why And Why Not?

问题 I am using the OpenCV for Android NDK, which can be downloaded from here, and I just use the libopencv_java4.so . It seems that it is not compiled with NEON (correct me if I am wrong). However, IMHO, NEON , the SIMD architecture, can dramatically speed up the library. On the other hand, I think if the people in OpenCV decide to compile without NEON, there must be a big reason. Therefore, I am hoping for advice about: Shall I compile with NEON? Will NEON boost up the speed (I think yes?) ?

how to vectorize a[i] = a[i-1] +c with AVX2

阅读更多关于 how to vectorize a[i] = a[i-1] +c with AVX2

问题 I want to vectorize a[i] = a[i-1] +c by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good. 回答1: I have implemented the following function for vectorizing this and it seems OK! The speedup is 2.5x over gcc -O3 Here is the solution: // vectorized inline void vec(int a[LEN], int b, int c) { // b=1 and c=2 in this case int i = 0; a[i++] = b;//0

What is the availability of 'vector long long'?

阅读更多关于 What is the availability of 'vector long long'?

问题 I'm testing on an old PowerMac G5, which is a Power4 machine. The build is failing: $ make ... g++ -DNDEBUG -g2 -O3 -mcpu=power4 -maltivec -c ppc-simd.cpp ppc-crypto.h:36: error: use of 'long long' in AltiVec types is invalid make: *** [ppc-simd.o] Error 1 The failure is due to: typedef __vector unsigned long long uint64x2_p8; I'm having trouble determining when I should make the typedef available. With -mcpu=power4 -maltivec the machine reports 64-bit availability: $ gcc -mcpu=power4

Intel's pragma simd vs OpenMP's pragma omp simd

阅读更多关于 Intel's pragma simd vs OpenMP's pragma omp simd

问题 The Intel compiler allows us to vectorize loops via #pragma simd for ( ... ) However, you also have the option to do this with OpenMP 4's directive: #pragma omp simd for ( ... ) Is there any difference between the two? 回答1: For all intents and purposes they should be identical. The difference is that the OpenMP 4.0 #pragma omp simd directive is portable and should work with other compilers that support OpenMP 4.0 as well as Intel's. Furthemore, there are several clauses in the OpenMP version

SSE alpha blending for pre-multiplied ARGB

阅读更多关于 SSE alpha blending for pre-multiplied ARGB

问题 I'm trying to write an SSE-enabled alpha compositor, this is what I've come up with. First, the code to blend two vectors of 4-pixels each: // alpha blend two 128-bit (16 byte) SSE vectors containing 4 pre-multiplied ARGB values each // __attribute__((always_inline)) static inline __m128i blend4(__m128i under, __m128i over) { // shuffle masks for alpha and 255 vector for 255-alpha // // NOTE: storing static __m128i here with _mm_set_si128 was _very_ slow, compiler doesn't seem // to know it