sse2 | 易学教程

SSE2 test xmm bitmask directly without using 'pmovmskb'

阅读更多关于 SSE2 test xmm bitmask directly without using 'pmovmskb'

问题 consider we have this: .... pxor xmm1, xmm1 movdqu xmm0, [reax] pcmpeqb xmm0, xmm1 pmovmskb eax, xmm0 test ax , ax jz .zero ... is there any way to not use 'pmovmskb' and test the bitmask directly from xmm0 (to check if it's zero) ? is there any SSE instruction for this action ? in fact, im searching for something like 'ptest xmm0, xmm0' action but in SSE2 ... not SSE4 回答1: It's generally not worth using SSE4.1 ptest xmm0,xmm0 on a pcmpeqb result, especially not if you're branching. pmovmskb

SSE2 test xmm bitmask directly without using 'pmovmskb'

阅读更多关于 SSE2 test xmm bitmask directly without using 'pmovmskb'

Performance optimisations of x86-64 assembly - Alignment and branch prediction

阅读更多关于 Performance optimisations of x86-64 assembly - Alignment and branch prediction

问题 I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen() , memset() , etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more. For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

问题 PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon? 回答1: __m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r =

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

How to simulate pcmpgtq on sse2?

阅读更多关于 How to simulate pcmpgtq on sse2?

SSE2 double multiplication slower than with standard multiplication

阅读更多关于 SSE2 double multiplication slower than with standard multiplication

问题 I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation. Here is the code: m_win = (double*)_aligned_malloc(size*sizeof(double), 16); __m128d* pData = (__m128d*)input().data; __m128d* pWin = (__m128d*)m_win; __m128d* pOut = (__m128d*)m_output.data; __m128d tmp; int i=0; for(; i<m_size/2;i++) pOut[i] = _mm_mul_pd(pData[i], pWin[i]); The memory for m_output.data and input().data has been allocated with _aligned_malloc.

What is the point of SSE2 instructions such as orpd?

阅读更多关于 What is the point of SSE2 instructions such as orpd?

问题 The orpd instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por ("bitwise logical OR")? If so, what's the point of having it? 回答1: Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps . Several instructions like this are redundant

What is the point of SSE2 instructions such as orpd?

阅读更多关于 What is the point of SSE2 instructions such as orpd?