intrinsics

Is mask adaptive in __shfl_up_sync call?

萝らか妹 提交于 2019-12-11 16:55:53
问题 Basically, it is a materialized version of this post. Suppose a warp need to process 4 objects(say, pixels in image), each 8 lanes are grouped together to process one object: Now I need do internal shuffle operations during processing one object(i.e. among 8 lanes of this object), it worked for each object just setting mask as 0xff : uint32_t mask = 0xff; __shfl_up_sync(mask,val,1); However, to my understanding, set mask as 0xff will force the lane0:lane7 of object0(or object3? also stuck on

OpenMP atomic _mm_add_pd

瘦欲@ 提交于 2019-12-11 13:14:33
问题 I'm trying to use OpenMP for parallelization of an already vectorized code with intrinsics, but the problem is that I'm using one XMM register as an outside 'variable' that I increment each loop. For now I'm using the shared clause __m128d xmm0 = _mm_setzero_pd(); __declspec(align(16)) double res[2]; #pragma omp parallel for shared(xmm0) for (int i = 0; i < len; i++) { __m128d xmm7 = ... result of some operations xmm0 = _mm_add_pd(xmm0, xmm7); } _mm_store_pd(res, xmm0); double final_result =

Is it possible to rotate a 128-bit value in Altivec?

烈酒焚心 提交于 2019-12-11 12:45:00
问题 I'm trying to port some ARM NEON code to AltiVec. Our NEON code has two LOAD's, one ROT, one XOR and a STORE so it seems like a simple test case. According to IBM's vec_rl documentation: Each element of the result is obtained by rotating the corresponding element of a left by the number of bits specified by the corresponding element of b. The docs go on to say vector unsigned int is the largest data type unless -qarch=power8 , in which case vector unsigned long long applies. I'd like to

Is it safe to compile one source with SSE2 another with AVX architecture?

≡放荡痞女 提交于 2019-12-11 12:28:37
问题 I'm using AVX intrinsics, but since for everything other than _mm256 based intrinsics MSVC generates non-vex instructions, I need to compiler the whole source code with /arch:AVX. The rest of the project is compiled with /arch:SSE2, so that it works on older CPUs and I'm manually checking if AVX is available. The source containing AVX code (compiled for AVX) includes a huge library of templates and other stuff, just to have the definitions. Is there a possibility that the compiler/linker

Testing NEON SIMD registers for equality over all lanes

巧了我就是萌 提交于 2019-12-11 04:34:54
问题 I'm using Neon Instrinics with clang. I want to test two uint32x4_t SIMD values for equality over all lanes. So not 4 test results, but one single result that tells me if A and B are equal for all lanes. On Intel AVX, I would use something like: _mm256_testz_si256( _mm256_xor_si256( A, B ), _mm256_set1_epi64x( -1 ) ) What would be a good way to perform an all-lane equality test for NEON SIMD? I am assuming I will need intrinsics that operate across lanes. Does ARM Neon have those features?

Linker errors when using intrinsic function via function pointer

痴心易碎 提交于 2019-12-11 04:04:49
问题 The code below doesn't compile with visual studio 2013. I get linker Error unresolved external symbol(LNK 2019) for the mm functions. If i use the functions directly, it all links fine. Why it doesn't compile? And is there a work-around #include "emmintrin.h" #include <smmintrin.h> #include <intrin.h> __m128i (*load)(const __m128i*) = NULL; if (it::isAligned<16>(ucpSrc, iXOffset * sizeof(unsigned char)) ) load = &_mm_load_si128; else load = &_mm_lddqu_si128; 回答1: Where some compilers such as

What is the difference between _m_empty and _mm_empty?

若如初见. 提交于 2019-12-11 03:33:22
问题 While I was looking for MMX functions, I noticed that two of them, _m_empty and _mm_empty , have exactly the same definition. So why do they both exist ? Is one of them older than the other ? Is there a difference that is not mentioned in the manual ? 回答1: Differences would/should be pointed out in the documentation. The MSDN is more precise. They explicitly mention this: A synonym for _mm_empty is _m_empty . 来源: https://stackoverflow.com/questions/32413644/what-is-the-difference-between-m

gcc (6.1.0) using 'wrong' instructions in SSE intrinsics

自古美人都是妖i 提交于 2019-12-11 03:24:53
问题 Background : I develop a computationally intensive tool, written in C/C++, that has to be able to be run on a variety of different x86_64 processors. To speed the calculations which are both float and integer, the code contains rather a lot of SSE* intrinsics with different paths tailored to different CPU SSE capabilities. (As the CPU flags are detected at the start of the program and used to set Booleans, I've assumed that the branch prediction for the tailored blocks of code will work very

Extract 4 SSE integers to 4 chars

巧了我就是萌 提交于 2019-12-11 03:05:44
问题 Suppose I have a __m128i containing 4 32-bit integer values. Is there some way I can store it inside a char[4] , where the lower char from each int value is stored in a char value? Desired result: r1 r2 r3 r4 __m128i 0x00000012 0x00000034 0x00000056 0x00000078 | V char[4] 0x12 0x34 0x56 0x78 SSE2 and below is preferred. Compiling on MSVC++. 回答1: With SSE2 you can use the following code: char[4] array; x = _mm_packs_epi32(x, x); x = _mm_packus_epi16(x, x); *((int*)array) = _mm_cvtsi128_si32(x)

How to specify alignment with _mm_mul_ps

回眸只為那壹抹淺笑 提交于 2019-12-11 02:19:09
问题 I am using an SSE intrinsic with one of the argument as a memory location ( _mm_mul_ps(xmm1,mem) ). I have a doubt which will be faster: xmm1 = _mm_mul_ps(xmm0,mem) // mem is 16 byte aligned or: xmm0 = _mm_load_ps(mem); xmm1 = _mm_mul_ps(xmm1,xmm0); Is there a way to specify alignment with _mm_mul_ps() intrinsic ? 回答1: There are no _mm_mul_ps(reg,mem) form even though mulps reg,mem instruction form exists - https://msdn.microsoft.com/en-us/library/22kbk6t9(v=vs.90).aspx What you can do is _mm