sse

Golang assembly implement of _mm_add_epi32

一个人想着一个人 提交于 2020-08-10 13:10:13
问题 I'm trying to implement _mm_add_epi32 in golang assembly, optionally with help of avo. But I know little about assembly and do not even know how to start it. Can you give me some hint of code? Thank you all. Here's the equivalent slower golang version: func add(x, y []uint32) []uint32 { if len(x) != len(y) { return nil } result := make([]uint32, len(x)) for i := 0; i < len(x); i++ { result[i] = x[i] + y[i] } return result } I know that the struction paddq xmm, xmm is what we need, but do not

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

99封情书 提交于 2020-08-05 04:47:31
问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

若如初见. 提交于 2020-08-05 04:47:11
问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

爱⌒轻易说出口 提交于 2020-08-05 04:47:09
问题 I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown: .text test: #xorps %xmm0, %xmm0 cvtsi2ss %edi, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0 addss %xmm0, %xmm0

What is the difference between non-packed and packed instruction in the context of SIMD-operations?

↘锁芯ラ 提交于 2020-08-04 18:50:58
问题 What is the difference between non-packed and packed instruction in the context of SIMD-operations? I was reading an article on optimizing your code for SSE: http://www.cortstratton.org/articles/OptimizingForSSE.php#batch and this question arose when I read "As an added bonus, movss is a non-packed instruction, which allows us to make better use of the parallel instruction decoders.." So what is the difference? 回答1: To my understanding, packed means that conceptually more than one value is

What is the difference between non-packed and packed instruction in the context of SIMD-operations?

谁都会走 提交于 2020-08-04 18:48:37
问题 What is the difference between non-packed and packed instruction in the context of SIMD-operations? I was reading an article on optimizing your code for SSE: http://www.cortstratton.org/articles/OptimizingForSSE.php#batch and this question arose when I read "As an added bonus, movss is a non-packed instruction, which allows us to make better use of the parallel instruction decoders.." So what is the difference? 回答1: To my understanding, packed means that conceptually more than one value is

What is the point of SSE2 instructions such as orpd?

橙三吉。 提交于 2020-07-30 06:04:50
问题 The orpd instruction is a "bitwise logical OR of packed double precision floating point values". Doesn't this do exactly the same thing as por ("bitwise logical OR")? If so, what's the point of having it? 回答1: Remember that SSE1 orps came first. (Well actually MMX por mm, mm/mem came even before SSE1.) Having the same opcode with a new prefix be the SSE2 orpd instruction makes sense for hardware decoder logic, I guess, just like movapd vs. movaps . Several instructions like this are redundant