intrinsics

Translating SSE to Neon: How to pack and then extract 32bit result

吃可爱长大的小学妹 提交于 2019-12-01 18:03:53
问题 I have to translate the following instructions from SSE to Neon uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) ); Where: static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t . Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example

How cast C++ class to intrinsic type

三世轮回 提交于 2019-12-01 18:00:41
Basic C++ class question: I have simple code currently that looks like something like this: typedef int sType; int array[100]; int test(sType s) { return array[ (int)s ]; } What I want, is to convert "sType" to a class, such that the "return array[ (int)s ]" line does not need to be changed. e.g. (pseudocode) class sType { public: int castInt() { return val; } int val; } int array[100]; int test(sType s) { return array[ (int)s ]; } Thanks for any help. class sType { public: operator int() const { return val; } private: int val; }; class sType { public: operator int() const { return val; } int

How cast C++ class to intrinsic type

偶尔善良 提交于 2019-12-01 16:30:16
问题 Basic C++ class question: I have simple code currently that looks like something like this: typedef int sType; int array[100]; int test(sType s) { return array[ (int)s ]; } What I want, is to convert "sType" to a class, such that the "return array[ (int)s ]" line does not need to be changed. e.g. (pseudocode) class sType { public: int castInt() { return val; } int val; } int array[100]; int test(sType s) { return array[ (int)s ]; } Thanks for any help. 回答1: class sType { public: operator int(

Optimisation using SSE Intrinsics

风格不统一 提交于 2019-12-01 12:48:45
I am trying to convert a loop I have into a SSE intrinsics. I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code. My original loop which I unrolled by a factor of 4 looks like this: int unroll_n = (N/4)*4; for (int j = 0; j < unroll_n; j++) { for (int i = 0; i < unroll_n; i+=4) { float rx = x[j] - x[i]; float ry = y[j] - y[i]; float rz = z[j] - z[i]; float r2 = rx*rx + ry*ry + rz*rz + eps; float r2inv = 1

Intrinsic to count trailing zero bits in 64-bit integers?

不问归期 提交于 2019-12-01 11:09:08
this is sort of a follow up on some previous questions on bit manipulation. I modified the code from this site to enumerate strings with K of N bits set (x is the current int64_t with K bits set, and at the end of this code it is the lexicographically next integer with K bits set): int64_t b, t, c, m, r,z; b = x & -x; t = x + b; c = x^t; // was m = (c >> 2)/b per link z = __builtin_ctz(x); m = c >> 2+z; x = t|m; The modification using __builtin_ctz() works fine as long as the least significant one bit is in the lower DWORD of x, but if is not, it totally breaks. This can be seen with the

Optimisation using SSE Intrinsics

血红的双手。 提交于 2019-12-01 10:52:26
问题 I am trying to convert a loop I have into a SSE intrinsics. I seem to have made fairly good progress, and by that I mean It's in the correct direction however I appear to have done some of the translation wrong somewhere as I am not getting the same "correct" answer which results from the non-sse code. My original loop which I unrolled by a factor of 4 looks like this: int unroll_n = (N/4)*4; for (int j = 0; j < unroll_n; j++) { for (int i = 0; i < unroll_n; i+=4) { float rx = x[j] - x[i];

How to swap two __m128i variables in C++03 given its an opaque type and an array?

99封情书 提交于 2019-12-01 08:35:34
What is the best practice for swapping __m128i variables? The background is a compile error under Sun Studio 12.2 , which is a C++03 compiler. __m128i is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2] . C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b) fails under the compiler. Here are some related questions that don't quite hit the mark. They don't apply because std::vector is not available. How can we swap 2 arrays in constant complexity or O(1)? Is it possible to swap arrays of structs C++03 moving a

A faster integer SSE unalligned load that's rarely used [duplicate]

霸气de小男生 提交于 2019-12-01 06:35:07
This question already has an answer here: what's the difference between _mm256_lddqu_si256 and _mm256_loadu_si256 1 answer I would like to know more about the _mm_lddqu_si128 intrinsic ( lddqu instruction since SSE3) particularly compared with the _mm_loadu_si128 intrinsic (movdqu instruction since SSE2) . I only discovered _mm_lddqu_si128 today. The intel intrinsic guide says this intrinsic may perform better than _mm_loadu_si128 when the data crosses a cache line boundary and a comment says it will perform better under certain circumstances, but never perform worse. So why is it not used

set individual bit in AVX register (__m256i), need “random access” operator

亡梦爱人 提交于 2019-12-01 06:21:57
问题 So, I want to set an individual bit of a __m256i register. Say, my __m256i contains: [ 1 0 1 0 | 1 0 1 0 | ... | 1 0 1 0 ] , how do I set and unset the n-th bit? 回答1: This is an implementation of function which can set individual bit inside a vector: #include <immintrin.h> #include <assert.h> void SetBit(__m256i & vector, size_t position, bool value) { assert(position <= 255); uint8_t lut[32] = { 0 }; lut[position >> 3] = 1 << (position & 7); __m256i mask = _mm256_loadu_si256((__m256i*)lut);

Issue with __m256 type of intel intrinsics

▼魔方 西西 提交于 2019-12-01 05:43:24
I'm trying to test some of the Intel Intrinsics to see how they work. So, i created a function to do that for me and this is the code: void test_intel_256() { __m256 res,vec1,vec2; __M256_MM_SET_PS(vec1, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0); __M256_MM_SET_PS(vec1, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0); __M256_MM_ADD_PS(res,vec1,vec2); if (res[0] ==9 && res[1] ==9 && res[2] ==9 && res[3] ==9 && res[4] ==9 && res[5] ==9 && res[6] ==9 && res[7] ==9 ) printf("Addition : OK!\n"); else printf("Addition : FAILED!\n"); } But then i'm getting these errors: error: unknown type name ‘__m256’ error: