intrinsics

When should I use _mm_sfence _mm_lfence and _mm_mfence

家住魔仙堡 提交于 2019-12-17 06:10:13
问题 I read the "Intel Optimization guide Guide For Intel Architecture". However, I still have no idea about when should I use _mm_sfence() _mm_lfence() _mm_mfence() Could anyone explain when these should be used when writing multi-threaded code? 回答1: Caveat : I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ... Intel is a weakly-ordered memory

How to add an AVX2 vector horizontally 3 by 3?

依然范特西╮ 提交于 2019-12-14 04:25:16
问题 I have a __m256i vector containing 16x16-bit elements.I want to apply a three adjacent horizontal addition on it. In scalar mode I use the following code: unsigned short int temp[16]; __m256i sum_v;//has some values. 16 elements of 16-bit vector. | 0 | x15 | x14 | x13 | ... | x3 | x2 | x1 | _mm256_store_si256((__m256i *)&temp[0], sum_v); output1 = (temp[0] + temp[1] + temp[2]); output2 = (temp[3] + temp[4] + temp[5]); output3 = (temp[6] + temp[7] + temp[8]); output4 = (temp[9] + temp[10] +

How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

孤者浪人 提交于 2019-12-14 03:29:51
问题 This question already has answers here : Load address calculation when using AVX2 gather instructions (3 answers) Closed last year . Intel's Intrinsic Guide says: __m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale) And: Description Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are

ARM Neon: conditional store suggestion

痞子三分冷 提交于 2019-12-13 18:12:23
问题 I'm trying to figure out how to generate a conditional Store in ARM neon. What I would like to do is the equivalent of this SSE instruction: void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p); which Conditionally stores byte elements of d to address p.The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored. Any suggestion on how to do it with NEON intrinsics? Thank you This is what I did: int8x16_t store_mask = {0,0,0,0,0,0,0xff,0xff,0xff

Compare the sign bit in SSE Intrinsics

杀马特。学长 韩版系。学妹 提交于 2019-12-13 16:52:43
问题 How would one create a mask using SSE intrinsics which indicates whether the signs of two packed floats (__m128's) are the same for example if comparing a and b where a is [1.0 -1.0 0.0 2.0] and b is [1.0 1.0 1.0 1.0] the desired mask we would get is [true false true true]. 回答1: Here's one solution: const __m128i MASK = _mm_set1_epi32(0xffffffff); __m128 a = _mm_setr_ps(1,-1,0,2); __m128 b = _mm_setr_ps(1,1,1,1); __m128 f = _mm_xor_ps(a,b); __m128i i = _mm_castps_si128(f); i = _mm_srai_epi32

What is the availability of 'vector long long'?

China☆狼群 提交于 2019-12-13 16:50:50
问题 I'm testing on an old PowerMac G5, which is a Power4 machine. The build is failing: $ make ... g++ -DNDEBUG -g2 -O3 -mcpu=power4 -maltivec -c ppc-simd.cpp ppc-crypto.h:36: error: use of 'long long' in AltiVec types is invalid make: *** [ppc-simd.o] Error 1 The failure is due to: typedef __vector unsigned long long uint64x2_p8; I'm having trouble determining when I should make the typedef available. With -mcpu=power4 -maltivec the machine reports 64-bit availability: $ gcc -mcpu=power4

_MM_TRANSPOSE4_PS causes compiler errors in GCC?

匆匆过客 提交于 2019-12-13 14:30:38
问题 I'm compiling my math library in GCC instead of MSVC for the first time and going through all the little errors, and I've hit one that simply makes no sense: Line 284: error: lvalue required as left operand of assignment What's on line 284? this: _MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f)); (r, u, and t are all instances of __m128 ) Those familiar with using xmmintrin.h will be aware that _MM_TRANSPOSE4_PS isn't actually a function, but rather a macro, which expands to: /*

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

给你一囗甜甜゛ 提交于 2019-12-13 13:25:37
问题 I'm trying to apply the log2 onto a __m128 variable. Like this: #include <immintrin.h> int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this returns this error: error: initializing '__m128' with an expression of incompatible type 'int' __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) ^ ~~~~~~~~~~~~~~~~~~ How can I fix it? 回答1: The immintrin.h you look into and immintrin.h used for compilation are

Why the speedup is lower than expected by using AVX2?

偶尔善良 提交于 2019-12-13 12:50:36
问题 I have vectorized the the inner loop of matrix addition using intrinsics instruction of AVX2, I also have the latency table from here. I expect that speedup should be a factor of 5, because almost 4 latency happens in 1024 iterations over 6 latency in 128 iterations, but the speedup is a factor of 3. so the question is what else is here that I don't see. I'm using gcc, coding in c, intrinsics, CPU is skylake 6700hq Here is c and assembly out put of the inner loop. global data: int __attribute

Does anybody know how to use Neon intrinsics uint8x8_t vclt_s8 (int8x8_t, int8x8_t)

吃可爱长大的小学妹 提交于 2019-12-13 04:24:50
问题 I want to compare 2 int8x8_t , From http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html we can get the description for vclt_s8 , but it does not tell us much details. `uint8x8_t vclt_s8 (int8x8_t, int8x8_t)` Form of expected instruction(s): vcgt.s8 d0, d0, d0 the return value uint8x8_t , it confuse me for I can not use if(vclt_s8(a, b)) to decide the first is smaller. Then suppose we have two int8x8_t : int8x8_t a and int8x8_t b , how do we know whether a is smaller? 回答1: You may find