sse

What is the point of SSE2 instructions such as orpd?

我只是一个虾纸丫 提交于 2020-07-30 06:04:04
问题 The orpd instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por ("bitwise logical OR")? If so, what's the point of having it? 回答1: Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps . Several instructions like this are redundant

Is there any situation where using MOVDQU and MOVUPD is better than MOVUPS?

眉间皱痕 提交于 2020-07-29 12:08:44
问题 I was trying to understand the different MOV instructions for SSE on intel x86-64. According to this you should use aligned instructions (MOVAPS, MOVAPD and MOVDQA) when moving data between 2 registers, using the correct one for the type you're operating with. And use MOVUPS/MOVAPS when moving register to memory and vice-versa, since type does not impact performance when moving to/from memory. So is there any reason to use MOVDQU and MOVUPD ever? Is the explanation I got on the link wrong?

Do all CPUs which support AVX2 also support SSE4.2 and AVX?

自作多情 提交于 2020-07-29 12:06:11
问题 I am planning to implement runtime detection of SIMD extensions. Is it such that if I find out that the processor has AVX2 support, it is also guaranteed to have SSE4.2 and AVX support? 回答1: Support for a more-recent Intel SIMD ISA extension implies support for previous SIMD ones. AVX2 definitely implies AVX1. I think AVX1 implies all of SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2 feature bits must also be set in CPUID. If not formally guaranteed, many things make this assumption and a CPU that

SSE Comparison Intrinsics - How to get 1 or 0 from a comparison?

你说的曾经没有我的故事 提交于 2020-07-21 04:52:44
问题 I am trying to write the equivalent of an if statement with SSE intrinsics. I am using __m128 _mm_cmplt_ps(__m128 a, __m128 b) to do the comparison a < b, and this returns 0xffffffff or 0x0 if the comparison was respectively true or false. I would like to convert these values into 1 and 0. In order to do this, is it correct to implement the logical "and" __m128 _mm_and_ps(__m128 c , __m128 d) , where c is the result of the conversion and d is, e.g., 0xffffffff ? Thank you for your attention.

Interpreting GDB registers (SSE registers)

戏子无情 提交于 2020-07-15 15:30:04
问题 I've been using GDB for 1 day and I've accumulated a decent understanding of it. However when I set a breakpoint at the final semicolon using GDB and print registers I can't fully interpret the meaning of the data stored into the XMM register. I don't know if the data is in (MSB > LSB) format or vice versa. __m128i S = _mm_load_si128((__m128i*)Array16Bytes); } So this is the result that I'm getting. (gdb) print $xmm0 $1 = { v4_float = {1.2593182e-07, -4.1251766e-18, -5.43431603e-31, -2

Interpreting GDB registers (SSE registers)

微笑、不失礼 提交于 2020-07-15 15:28:07
问题 I've been using GDB for 1 day and I've accumulated a decent understanding of it. However when I set a breakpoint at the final semicolon using GDB and print registers I can't fully interpret the meaning of the data stored into the XMM register. I don't know if the data is in (MSB > LSB) format or vice versa. __m128i S = _mm_load_si128((__m128i*)Array16Bytes); } So this is the result that I'm getting. (gdb) print $xmm0 $1 = { v4_float = {1.2593182e-07, -4.1251766e-18, -5.43431603e-31, -2

SSE optimized emulation of 64-bit integers

我的未来我决定 提交于 2020-07-04 13:10:05
问题 For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast . Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior). So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE. The operations I

SSE optimized emulation of 64-bit integers

天大地大妈咪最大 提交于 2020-07-04 13:08:56
问题 For a hobby project I'm working on, I need to emulate certain 64-bit integer operations on a x86 CPU, and it needs to be fast . Currently, I'm doing this via MMX instructions, but that's really a pain to work with, because I have to flush the fp register state all the time (and because most MMX instructions deal with signed integers, and I need unsigned behavior). So I'm wondering if the SSE/optimization gurus here on SO can come up with a better implementation using SSE. The operations I

sqrt of uint64_t vs. int64_t

有些话、适合烂在心里 提交于 2020-06-27 07:26:54
问题 I noticed that calculating the integer part of square root of uint64_t is much more complicated than of int64_t . Please, does anybody have an explanation for this? Why is it seemingly much more difficult to deal with one extra bit? The following: int64_t sqrt_int(int64_t a) { return sqrt(a); } compiles with clang 5.0 and -mfpmath=sse -msse3 -Wall -O3 to sqrt_int(long): # @sqrt_int(long) cvtsi2sd xmm0, rdi sqrtsd xmm0, xmm0 cvttsd2si rax, xmm0 ret But the following: uint64_t sqrt_int(uint64_t

sqrt of uint64_t vs. int64_t

孤街醉人 提交于 2020-06-27 07:26:23
问题 I noticed that calculating the integer part of square root of uint64_t is much more complicated than of int64_t . Please, does anybody have an explanation for this? Why is it seemingly much more difficult to deal with one extra bit? The following: int64_t sqrt_int(int64_t a) { return sqrt(a); } compiles with clang 5.0 and -mfpmath=sse -msse3 -Wall -O3 to sqrt_int(long): # @sqrt_int(long) cvtsi2sd xmm0, rdi sqrtsd xmm0, xmm0 cvttsd2si rax, xmm0 ret But the following: uint64_t sqrt_int(uint64_t