sse2

SSE2 test xmm bitmask directly without using 'pmovmskb'

a 夏天 提交于 2021-02-11 16:41:40
问题 consider we have this: .... pxor xmm1, xmm1 movdqu xmm0, [reax] pcmpeqb xmm0, xmm1 pmovmskb eax, xmm0 test ax , ax jz .zero ... is there any way to not use 'pmovmskb' and test the bitmask directly from xmm0 (to check if it's zero) ? is there any SSE instruction for this action ? in fact, im searching for something like 'ptest xmm0, xmm0' action but in SSE2 ... not SSE4 回答1: It's generally not worth using SSE4.1 ptest xmm0,xmm0 on a pcmpeqb result, especially not if you're branching. pmovmskb

SSE2 test xmm bitmask directly without using 'pmovmskb'

蓝咒 提交于 2021-02-11 16:40:42
问题 consider we have this: .... pxor xmm1, xmm1 movdqu xmm0, [reax] pcmpeqb xmm0, xmm1 pmovmskb eax, xmm0 test ax , ax jz .zero ... is there any way to not use 'pmovmskb' and test the bitmask directly from xmm0 (to check if it's zero) ? is there any SSE instruction for this action ? in fact, im searching for something like 'ptest xmm0, xmm0' action but in SSE2 ... not SSE4 回答1: It's generally not worth using SSE4.1 ptest xmm0,xmm0 on a pcmpeqb result, especially not if you're branching. pmovmskb

Performance optimisations of x86-64 assembly - Alignment and branch prediction

故事扮演 提交于 2021-02-08 19:50:37
问题 I’m currently coding highly optimised versions of some C99 standard library string functions, like strlen() , memset() , etc, using x86-64 assembly with SSE-2 instructions. So far I’ve managed to get excellent results in terms of performance, but I sometimes get weird behaviour when I try to optimise more. For instance, adding or even removing some simple instructions, or simply reorganising some local labels used with jumps completely degrades the overall performances. And there’s absolutely

How to simulate pcmpgtq on sse2?

寵の児 提交于 2021-02-05 05:13:17
问题 PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon? 回答1: __m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r =

How to simulate pcmpgtq on sse2?

醉酒当歌 提交于 2021-02-05 05:11:43
问题 PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon? 回答1: __m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r =

How to simulate pcmpgtq on sse2?

喜欢而已 提交于 2021-02-05 05:11:01
问题 PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon? 回答1: __m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r =

How to simulate pcmpgtq on sse2?

本小妞迷上赌 提交于 2021-02-05 05:05:43
问题 PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask. How does one support this functionality on instructions sets predating sse4.2? Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon? 回答1: __m128i pcmpgtq_sse2 (__m128i a, __m128i b) { __m128i r =

SSE2 double multiplication slower than with standard multiplication

感情迁移 提交于 2021-01-29 09:42:28
问题 I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation. Here is the code: m_win = (double*)_aligned_malloc(size*sizeof(double), 16); __m128d* pData = (__m128d*)input().data; __m128d* pWin = (__m128d*)m_win; __m128d* pOut = (__m128d*)m_output.data; __m128d tmp; int i=0; for(; i<m_size/2;i++) pOut[i] = _mm_mul_pd(pData[i], pWin[i]); The memory for m_output.data and input().data has been allocated with _aligned_malloc.

What is the point of SSE2 instructions such as orpd?

橙三吉。 提交于 2020-07-30 06:04:50
问题 The orpd instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por ("bitwise logical OR")? If so, what's the point of having it? 回答1: Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps . Several instructions like this are redundant

What is the point of SSE2 instructions such as orpd?

我只是一个虾纸丫 提交于 2020-07-30 06:04:04
问题 The orpd instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por ("bitwise logical OR")? If so, what's the point of having it? 回答1: Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps . Several instructions like this are redundant