sse

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

房东的猫 提交于 2019-12-23 14:46:10
问题 This question already has answers here : AVX2 what is the most efficient way to pack left based on a mask? (4 answers) Closed 3 years ago . In the question Optimizing Array Compaction, the top answer states: SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB. Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512? I've got a AVX2

Compact AVX2 register so selected integers are contiguous according to mask [duplicate]

社会主义新天地 提交于 2019-12-23 14:45:21
问题 This question already has answers here : AVX2 what is the most efficient way to pack left based on a mask? (4 answers) Closed 3 years ago . In the question Optimizing Array Compaction, the top answer states: SSE/AVX registers with latest instruction sets allow a better approach. We can use the result of PMOVMSKB directly, transforming it to the control register for something like PSHUFB. Is this possible with Haswell (AVX2)? Or does it require one of the flavors of AVX512? I've got a AVX2

SSE Loading & Adding

最后都变了- 提交于 2019-12-23 12:52:31
问题 Assume I have two vectors represented by two arrays of type double , each of size 2. I'd like to add corresponding positions. So assume vectors i0 and i1 , I'd like to add i0[0] + i1[0] and i0[1] + i1[1] together. Since the type is double , I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself. My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]) , will that place them in the lower and

For XMM/YMM FP operation on Intel Haswell, can FMA be used in place of ADD?

大憨熊 提交于 2019-12-23 11:52:47
问题 This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell. So according to the awesome , awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only

implications of using _mm_shuffle_ps on integer vector

拜拜、爱过 提交于 2019-12-23 11:46:50
问题 SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2 . However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i , then you can use _mm_shuffle_ps as well: #include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values =

Work around lack of Yz machine constraint under Clang?

旧巷老猫 提交于 2019-12-23 10:10:52
问题 We use inline assembly to make SHA instructions available if __SHA__ is not defined. Under GCC we use: GCC_INLINE __m128i GCC_INLINE_ATTRIB MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c) { asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "Yz" (c)); return a; } Clang does not consume GCC's Yz constraint (see Clang 3.2 Issue 13199 and Clang 3.9 Issue 32727), which is required by the sha256rnds2 instruction: Yz First SSE register (%xmm0). We added a mov for Clang: asm ("mov

Forcing AVX intrinsics to use SSE instructions instead

ぃ、小莉子 提交于 2019-12-23 09:26:32
问题 Unfortunately I have an AMD piledriver cpu, which seems to have problems with AVX instructions: Memory writes with the 256-bit AVX registers are exceptionally slow. The measured throughput is 5 - 6 times slower than on the previous model (Bulldozer), and 8 - 9 times slower than two 128-bit writes. In my own experience, I've found mm256 intrinsics to be much slower than mm128, and I'm assuming it's because of the above reason. I really want to code for the newest instruction set AVX though,

Segmentation fault while working with SSE intrinsics due to incorrect memory alignment

非 Y 不嫁゛ 提交于 2019-12-23 09:23:41
问题 I am working with SSE intrinsics for the first time and I am encountering a segmentation fault even after ensuring 16byte memory alignment. This post is an extension to my earlier question: How to allocate 16byte memory aligned data This is how I have declared my array: float *V = (float*) memalign(16,dx*sizeof(float)); When I try to do this: __m128 v_i = _mm_load_ps(&V[i]); //It works But when I do this: __m128 u1 = _mm_load_ps(&V[(i-1)]); //There is a segmentation fault But if I do : __m128

Best way to shuffle 64-bit portions of two __m128i's

我们两清 提交于 2019-12-23 07:49:54
问题 I have two __m128i s, a and b , that I want to shuffle so that the upper 64 bits of a fall in the lower 64 bits of dst and the lower 64 bits of b fall in the upper 64 of dst . i.e. dst[ 0:63] = a[64:127] dst[64:127] = b[0:63] Equivalent to: __m128i dst = _mm_unpacklo_epi64(_mm_srli_si128i(a, 8), b); or __m128i dst = _mm_castpd_si128(mm_shuffle_pd(_mm_castsi128_pd(a),_mm_castsi128_pd(b),1)); Is there a better way to do this than the first method? The second one is just one instruction, but the

Is there still any development on SIMD in Mono?

丶灬走出姿态 提交于 2019-12-23 07:46:03
问题 I want to know if there has been/is any development on Mono.SIMD (or SIMD-support in general inside Mono) ever since it came out 5(!) years ago. I personally think this was a great step in improving speed for c#. However, I've used it for some time now and I'm feeling that Mono.SIMD is falling behind, as lots of functions are missing. Some of the problems i'm facing include: The lack of a dot product, which can be implemented in 1 operation ever since SSE4.1 (which came out in 2006 and is now