intrinsics | 易学教程

When should I use _mm_sfence _mm_lfence and _mm_mfence

阅读更多关于 When should I use _mm_sfence _mm_lfence and _mm_mfence

问题 I read the "Intel Optimization guide Guide For Intel Architecture". However, I still have no idea about when should I use _mm_sfence() _mm_lfence() _mm_mfence() Could anyone explain when these should be used when writing multi-threaded code? 回答1: Caveat : I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ... Intel is a weakly-ordered memory

How to add an AVX2 vector horizontally 3 by 3?

阅读更多关于 How to add an AVX2 vector horizontally 3 by 3?

问题 I have a __m256i vector containing 16x16-bit elements.I want to apply a three adjacent horizontal addition on it. In scalar mode I use the following code: unsigned short int temp[16]; __m256i sum_v;//has some values. 16 elements of 16-bit vector. | 0 | x15 | x14 | x13 | ... | x3 | x2 | x1 | _mm256_store_si256((__m256i *)&temp[0], sum_v); output1 = (temp[0] + temp[1] + temp[2]); output2 = (temp[3] + temp[4] + temp[5]); output3 = (temp[6] + temp[7] + temp[8]); output4 = (temp[9] + temp[10] +

How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

阅读更多关于 How to use vindex and scale with _mm_i32gather_epi32 to gather elements? [duplicate]

问题 This question already has answers here : Load address calculation when using AVX2 gather instructions (3 answers) Closed last year . Intel's Intrinsic Guide says: __m128i _mm_i32gather_epi32 (int const* base_addr, __m128i vindex, const int scale) And: Description Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are

ARM Neon: conditional store suggestion

阅读更多关于 ARM Neon: conditional store suggestion

问题 I'm trying to figure out how to generate a conditional Store in ARM neon. What I would like to do is the equivalent of this SSE instruction: void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p); which Conditionally stores byte elements of d to address p.The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored. Any suggestion on how to do it with NEON intrinsics? Thank you This is what I did: int8x16_t store_mask = {0,0,0,0,0,0,0xff,0xff,0xff

Compare the sign bit in SSE Intrinsics

阅读更多关于 Compare the sign bit in SSE Intrinsics

问题 How would one create a mask using SSE intrinsics which indicates whether the signs of two packed floats (__m128's) are the same for example if comparing a and b where a is [1.0 -1.0 0.0 2.0] and b is [1.0 1.0 1.0 1.0] the desired mask we would get is [true false true true]. 回答1: Here's one solution: const __m128i MASK = _mm_set1_epi32(0xffffffff); __m128 a = _mm_setr_ps(1,-1,0,2); __m128 b = _mm_setr_ps(1,1,1,1); __m128 f = _mm_xor_ps(a,b); __m128i i = _mm_castps_si128(f); i = _mm_srai_epi32

What is the availability of 'vector long long'?

阅读更多关于 What is the availability of 'vector long long'?

问题 I'm testing on an old PowerMac G5, which is a Power4 machine. The build is failing: $ make ... g++ -DNDEBUG -g2 -O3 -mcpu=power4 -maltivec -c ppc-simd.cpp ppc-crypto.h:36: error: use of 'long long' in AltiVec types is invalid make: *** [ppc-simd.o] Error 1 The failure is due to: typedef __vector unsigned long long uint64x2_p8; I'm having trouble determining when I should make the typedef available. With -mcpu=power4 -maltivec the machine reports 64-bit availability: $ gcc -mcpu=power4

_MM_TRANSPOSE4_PS causes compiler errors in GCC?

阅读更多关于 _MM_TRANSPOSE4_PS causes compiler errors in GCC?

问题 I'm compiling my math library in GCC instead of MSVC for the first time and going through all the little errors, and I've hit one that simply makes no sense: Line 284: error: lvalue required as left operand of assignment What's on line 284? this: _MM_TRANSPOSE4_PS(r, u, t, _mm_setr_ps(0.0f, 0.0f, 0.0f, 1.0f)); (r, u, and t are all instances of __m128 ) Those familiar with using xmmintrin.h will be aware that _MM_TRANSPOSE4_PS isn't actually a function, but rather a macro, which expands to: /*

C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

阅读更多关于 C: x86 Intel Intrinsics usage of _mm_log2_ps() -> error: incompatible type 'int'?

问题 I'm trying to apply the log2 onto a __m128 variable. Like this: #include <immintrin.h> int main (void) { __m128 two_v = {2.0, 2.0, 2.0, 2.0}; __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) return 0; } Trying to compile this returns this error: error: initializing '__m128' with an expression of incompatible type 'int' __m128 log2_v = _mm_log2_ps(two_v); // log_2 := log(2) ^ ~~~~~~~~~~~~~~~~~~ How can I fix it? 回答1: The immintrin.h you look into and immintrin.h used for compilation are

Why the speedup is lower than expected by using AVX2?

阅读更多关于 Why the speedup is lower than expected by using AVX2?

问题 I have vectorized the the inner loop of matrix addition using intrinsics instruction of AVX2, I also have the latency table from here. I expect that speedup should be a factor of 5, because almost 4 latency happens in 1024 iterations over 6 latency in 128 iterations, but the speedup is a factor of 3. so the question is what else is here that I don't see. I'm using gcc, coding in c, intrinsics, CPU is skylake 6700hq Here is c and assembly out put of the inner loop. global data: int __attribute

Does anybody know how to use Neon intrinsics uint8x8_t vclt_s8 (int8x8_t, int8x8_t)

阅读更多关于 Does anybody know how to use Neon intrinsics uint8x8_t vclt_s8 (int8x8_t, int8x8_t)

问题 I want to compare 2 int8x8_t , From http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html we can get the description for vclt_s8 , but it does not tell us much details. `uint8x8_t vclt_s8 (int8x8_t, int8x8_t)` Form of expected instruction(s): vcgt.s8 d0, d0, d0 the return value uint8x8_t , it confuse me for I can not use if(vclt_s8(a, b)) to decide the first is smaller. Then suppose we have two int8x8_t : int8x8_t a and int8x8_t b , how do we know whether a is smaller? 回答1: You may find