sse | 易学教程

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

阅读更多关于 Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

问题 This question already has answers here : AVX2 what is the most efficient way to pack left based on a mask? (4 answers) Closed 3 years ago . In the question Optimizing Array Compaction, the top answer states: SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB. Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512? I've got a AVX2

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

阅读更多关于 Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

SSE Loading & Adding

阅读更多关于 SSE Loading & Adding

问题 Assume I have two vectors represented by two arrays of type double , each of size 2. I'd like to add corresponding positions. So assume vectors i0 and i1 , I'd like to add i0[0] + i1[0] and i0[1] + i1[1] together. Since the type is double , I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself. My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]) , will that place them in the lower and

For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

阅读更多关于 For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

问题 This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell. So according to the awesome , awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only

implications of using _mm_shuffle_ps on integer vector

阅读更多关于 implications of using _mm_shuffle_ps on integer vector

问题 SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2 . However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i , then you can use _mm_shuffle_ps as well: #include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values =

Work around lack of Yz machine constraint under Clang?

阅读更多关于 Work around lack of Yz machine constraint under Clang?

问题 We use inline assembly to make SHA instructions available if __SHA__ is not defined. Under GCC we use: GCC_INLINE __m128i GCC_INLINE_ATTRIB MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c) { asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "Yz" (c)); return a; } Clang does not consume GCC's Yz constraint (see Clang 3.2 Issue 13199 and Clang 3.9 Issue 32727), which is required by the sha256rnds2 instruction: Yz First SSE register (%xmm0). We added a mov for Clang: asm ("mov

Forcing AVX intrinsics to use SSE instructions instead

阅读更多关于 Forcing AVX intrinsics to use SSE instructions instead

问题 Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions: Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason. I really want to code for the newest instruction set AVX though,

Segmentation fault while working with SSE intrinsics due to incorrect memory alignment

阅读更多关于 Segmentation fault while working with SSE intrinsics due to incorrect memory alignment

问题 I am working with SSE intrinsics for the first time and I am encountering a segmentation fault even after ensuring 16byte memory alignment. This post is an extension to my earlier question: How to allocate 16byte memory aligned data This is how I have declared my array: float *V = (float*) memalign(16,dx*sizeof(float)); When I try to do this: __m128 v_i = _mm_load_ps(&V[i]); //It works But when I do this: __m128 u1 = _mm_load_ps(&V[(i-1)]); //There is a segmentation fault But if I do : __m128

Best way to shuffle 64-bit portions of two __m128i's

阅读更多关于 Best way to shuffle 64-bit portions of two __m128i's

问题 I have two __m128i s, a and b , that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst . i.e. dst[ 0:63] = a[64:127] dst[64:127] = b[0:63] Equivalent to: __m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b); or __m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1)); Is there a better way to do this than the first method? The second one is just one instruction, but the

Is there still any development on SIMD in Mono?

阅读更多关于 Is there still any development on SIMD in Mono?

问题 I want to know if there has been/is any development on Mono.SIMD (or SIMD-support in general inside Mono) ever since it came out 5(!) years ago. I personally think this was a great step in improving speed for c#. However, I've used it for some time now and I'm feeling that Mono.SIMD is falling behind, as lots of functions are missing. Some of the problems i'm facing include: The lack of a dot product, which can be implemented in 1 operation ever since SSE4.1 (which came out in 2006 and is now