sse

Choice between aligned vs. unaligned x86 SIMD instructions

早过忘川 提交于 2019-12-10 03:29:07
问题 There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] B. And the ones that work with unaligned memory addresses, that will not raise such exception: movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax]

Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?

穿精又带淫゛_ 提交于 2019-12-10 02:14:19
问题 Along with the introduction of AVX, Intel introduced the VEX encoding scheme into the Intel 64 and IA-32 architecture. This encoding scheme is used mostly with AVX instructions. I was wondering if it's okay to intermix VEX-encoded instructions and the now called "legacy SSE" instructions. The main reason for me asking this question is code size. Consider these two instructions : shufps xmm0, xmm0, 0 vshufps xmm0, xmm0, xmm0, 0 I commonly use the first one to "broadcast" a scalar value to all

Atomic operators, SSE/AVX, and OpenMP

本小妞迷上赌 提交于 2019-12-09 22:57:01
问题 I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions. Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code: float4 sum4 = 0.0f; /

Shuffle even and odd vaues in SSE register

天大地大妈咪最大 提交于 2019-12-09 18:36:06
问题 I load two SSE 128bit registers with 16 bit values. The values are in the following order: src[0] = [E_3, O_3, E_2, O_2, E_1, O_1, E_0, O_0] src[1] = [E_7, O_7, E_6, O_6, E_5, O_5, E_4, O_4] What I want to achieve is an order like this: src[0] = [E_7, E_6, E_5, E_4, E_3, E_2, E_1, E_0] src[1] = [O_7, O_6, O_5, O_4, O_3, O_2, O_1, O_0] Did you know if there is a good way to do this (by using SSE intrinsics up to SSE 4.2)? I'm stuck at the moment, because I can't shuffle 16 bit values between

SIMD (SSE) instruction for division in GCC

痴心易碎 提交于 2019-12-09 18:27:45
问题 I'd like to optimize the following snippet using SSE instructions if possible: /* * the data structure */ typedef struct v3d v3d; struct v3d { double x; double y; double z; } tmp = { 1.0, 2.0, 3.0 }; /* * the part that should be "optimized" */ tmp.x /= 4.0; tmp.y /= 4.0; tmp.z /= 4.0; Is this possible at all? 回答1: I've used SIMD extension under windows, but have not yet under linux. That being said you should be able to take advantage of the DIVPS SSE operation which will divide a 4 float

SSE 4 popcount for 16 8-bit values?

时光总嘲笑我的痴心妄想 提交于 2019-12-09 18:00:42
问题 I have the following code which compiles with GCC using the flag -msse4 but the problem is that the pop count only gets the last four 8-bits of the converted __m128i type. Basically what I want is to count all 16 numbers inside the __m128i type but I'm not sure what intrinsic function call to make after creating the variable popA . Somehow popA has to be converted into an integer that contains all the 128-bits of information? I suppose theres _mm_cvtsi128_si64 and using a few shuffle few

SSE multiplication 16 x uint8_t

帅比萌擦擦* 提交于 2019-12-09 04:55:09
问题 I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8 ? 回答1: There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows: inline __m128i _mm_mullo_epi8(__m128i a, __m128i b) { __m128i zero = _mm_setzero_si128(); __m128i Alo = _mm_cvtepu8_epi16(a); __m128i Ahi = _mm_unpackhi_epi8(a

Counting the number of leading zeros in a 128-bit integer

最后都变了- 提交于 2019-12-09 00:42:53
问题 How can I count the number of leading zeros in a 128-bit integer ( uint128_t ) efficiently? I know GCC's built-in functions: __builtin_clz , __builtin_clzl , __builtin_clzll __builtin_ffs , __builtin_ffsl , __builtin_ffsll However, these functions only work with 32- and 64-bit integers. I also found some SSE instructions: __lzcnt16 , __lzcnt , __lzcnt64 As you may guess, these only work with 16-, 32- and 64-bit integers. Is there any similar, efficient built-in functionality for 128-bit

Storing two x86 32 bit registers into 128 bit xmm register

拟墨画扇 提交于 2019-12-08 23:13:20
问题 Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register? movd xmm0, edx movd xmm1, eax pshufd xmm0, xmm0, $1 por xmm0, xmm1 So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678. Thanks 回答1: With SSE 4.1 you can use movd xmm0, eax / pinsrd xmm0, edx, 1 and do it in 2 instructions. For older CPUs you can use 2 x movd and then punpckldq for a total of 3 instructions: movd xmm0, edx movd xmm1, eax punpckldq xmm0, xmm1 回答2: I

Optimzing SSE-code

旧时模样 提交于 2019-12-08 22:19:26
问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner