intrinsics | 易学教程

Is mask adaptive in __shfl_up_sync call?

阅读更多关于 Is mask adaptive in __shfl_up_sync call?

问题 Basically, it is a materialized version of this post. Suppose a warp need to process 4 objects(say, pixels in image), each 8 lanes are grouped together to process one object: Now I need do internal shuffle operations during processing one object(i.e. among 8 lanes of this object), it worked for each object just setting mask as 0xff : uint32_t mask = 0xff; __shfl_up_sync(mask,val,1); However, to my understanding, set mask as 0xff will force the lane0:lane7 of object0(or object3? also stuck on

OpenMP atomic _mm_add_pd

阅读更多关于 OpenMP atomic _mm_add_pd

问题 I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared clause __m128d xmm0 = _mm_setzero_pd(); __declspec(align(16)) double res[2]; #pragma omp parallel for shared(xmm0) for (int i = 0; i < len; i++) { __m128d xmm7 = ... result of some operations xmm0 = _mm_add_pd(xmm0, xmm7); } _mm_store_pd(res, xmm0); double final_result =

Is it possible to rotate a 128-bit value in Altivec?

阅读更多关于 Is it possible to rotate a 128-bit value in Altivec?

问题 I'm trying to port some ARM NEON code to AltiVec. Our NEON code has two LOAD's, one ROT, one XOR and a STORE so it seems like a simple test case. According to IBM's vec_rl documentation: Each element of the result is obtained by rotating the corresponding element of a left by the number of bits specified by the corresponding element of b. The docs go on to say vector unsigned int is the largest data type unless -qarch=power8 , in which case vector unsigned long long applies. I'd like to

Is it safe to compile one source with SSE2 another with AVX architecture?

阅读更多关于 Is it safe to compile one source with SSE2 another with AVX architecture?

问题 I'm using AVX intrinsics, but since for everything other than _mm256 based intrinsics MSVC generates non-vex instructions, I need to compiler the whole source code with /arch:AVX. The rest of the project is compiled with /arch:SSE2, so that it works on older CPUs and I'm manually checking if AVX is available. The source containing AVX code (compiled for AVX) includes a huge library of templates and other stuff, just to have the definitions. Is there a possibility that the compiler/linker

Testing NEON SIMD registers for equality over all lanes

阅读更多关于 Testing NEON SIMD registers for equality over all lanes

问题 I'm using Neon Instrinics with clang. I want to test two uint32x4_t SIMD values for equality over all lanes. So not 4 test results, but one single result that tells me if A and B are equal for all lanes. On Intel AVX, I would use something like: _mm256_testz_si256( _mm256_xor_si256( A, B ), _mm256_set1_epi64x( -1 ) ) What would be a good way to perform an all-lane equality test for NEON SIMD? I am assuming I will need intrinsics that operate across lanes. Does ARM Neon have those features?

Linker errors when using intrinsic function via function pointer

阅读更多关于 Linker errors when using intrinsic function via function pointer

问题 The code below doesn't compile with visual studio 2013. I get linker Error unresolved external symbol(LNK 2019) for the mm functions. If i use the functions directly, it all links fine. Why it doesn't compile? And is there a work-around #include "emmintrin.h" #include <smmintrin.h> #include <intrin.h> __m128i (*load)(const __m128i*) = NULL; if (it::isAligned<16>(ucpSrc, iXOffset * sizeof(unsigned char)) ) load = &_mm_load_si128; else load = &_mm_lddqu_si128; 回答1: Where some compilers such as

What is the difference between _m_empty and _mm_empty?

阅读更多关于 What is the difference between _m_empty and _mm_empty?

问题 While I was looking for MMX functions, I noticed that two of them, _m_empty and _mm_empty , have exactly the same definition. So why do they both exist ? Is one of them older than the other ? Is there a difference that is not mentioned in the manual ? 回答1: Differences would/should be pointed out in the documentation. The MSDN is more precise. They explicitly mention this: A synonym for _mm_empty is _m_empty . 来源： https://stackoverflow.com/questions/32413644/what-is-the-difference-between-m

gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

阅读更多关于 gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

问题 Background : I develop a computationally intensive tool, written in C/C++, that has to be able to be run on a variety of different x86_64 processors. To speed the calculations which are both float and integer, the code contains rather a lot of SSE* intrinsics with different paths tailored to different CPU SSE capabilities. (As the CPU flags are detected at the start of the program and used to set Booleans, I've assumed that the branch prediction for the tailored blocks of code will work very

Extract 4 SSE integers to 4 chars

阅读更多关于 Extract 4 SSE integers to 4 chars

问题 Suppose I have a __m128i containing 4 32-bit integer values. Is there some way I can store it inside a char[4] , where the lower char from each int value is stored in a char value? Desired result: r1 r2 r3 r4 __m128i 0x00000012 0x00000034 0x00000056 0x00000078 | V char[4] 0x12 0x34 0x56 0x78 SSE2 and below is preferred. Compiling on MSVC++. 回答1: With SSE2 you can use the following code: char[4] array; x = _mm_packs_epi32(x, x); x = _mm_packus_epi16(x, x); *((int*)array) = _mm_cvtsi128_si32(x)

How to specify alignment with _mm_mul_ps

阅读更多关于 How to specify alignment with _mm_mul_ps

问题 I am using an SSE intrinsic with one of the argument as a memory location ( _mm_mul_ps(xmm1,mem) ). I have a doubt which will be faster: xmm1 = _mm_mul_ps(xmm0,mem) // mem is 16 byte aligned or: xmm0 = _mm_load_ps(mem); xmm1 = _mm_mul_ps(xmm1,xmm0); Is there a way to specify alignment with _mm_mul_ps() intrinsic ? 回答1: There are no _mm_mul_ps(reg,mem) form even though mulps reg,mem instruction form exists - https://msdn.microsoft.com/en-us/library/22kbk6t9(v=vs.90).aspx What you can do is _mm