sse | 易学教程

Choice between aligned vs. unaligned x86 SIMD instructions

阅读更多关于 Choice between aligned vs. unaligned x86 SIMD instructions

问题 There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] B. And the ones that work with unaligned memory addresses, that will not raise such exception: movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax]

Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?

阅读更多关于 Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?

问题 Along with the introduction of AVX, Intel introduced the VEX encoding scheme into the Intel 64 and IA-32 architecture. This encoding scheme is used mostly with AVX instructions. I was wondering if it's okay to intermix VEX-encoded instructions and the now called "legacy SSE" instructions. The main reason for me asking this question is code size. Consider these two instructions : shufps xmm0, xmm0, 0 vshufps xmm0, xmm0, xmm0, 0 I commonly use the first one to "broadcast" a scalar value to all

Atomic operators, SSE/AVX, and OpenMP

阅读更多关于 Atomic operators, SSE/AVX, and OpenMP

问题 I'm wondering if SSE/AVX operations such as addition and multiplication can be an atomic operation? The reason I ask this is that in OpenMP the atomic construct only works on a limited set of operators. It does not work on for example SSE/AVX additions. Let's assume I had a datatype float4 that corresponds to a SSE register and that the addition operator is defined for float4 to do an SSE addition. In OpenMP I could do a reduction over an array with the following code: float4 sum4 = 0.0f; /

Shuffle even and odd vaues in SSE register

阅读更多关于 Shuffle even and odd vaues in SSE register

问题 I load two SSE 128bit registers with 16 bit values. The values are in the following order: src[0] = [E_3, O_3, E_2, O_2, E_1, O_1, E_0, O_0] src[1] = [E_7, O_7, E_6, O_6, E_5, O_5, E_4, O_4] What I want to achieve is an order like this: src[0] = [E_7, E_6, E_5, E_4, E_3, E_2, E_1, E_0] src[1] = [O_7, O_6, O_5, O_4, O_3, O_2, O_1, O_0] Did you know if there is a good way to do this (by using SSE intrinsics up to SSE 4.2)? I'm stuck at the moment, because I can't shuffle 16 bit values between

SIMD (SSE) instruction for division in GCC

阅读更多关于 SIMD (SSE) instruction for division in GCC

问题 I'd like to optimize the following snippet using SSE instructions if possible: /* * the data structure */ typedef struct v3d v3d; struct v3d { double x; double y; double z; } tmp = { 1.0, 2.0, 3.0 }; /* * the part that should be "optimized" */ tmp.x /= 4.0; tmp.y /= 4.0; tmp.z /= 4.0; Is this possible at all? 回答1: I've used SIMD extension under windows, but have not yet under linux. That being said you should be able to take advantage of the DIVPS SSE operation which will divide a 4 float

SSE 4 popcount for 16 8-bit values?

阅读更多关于 SSE 4 popcount for 16 8-bit values?

问题 I have the following code which compiles with GCC using the flag -msse4 but the problem is that the pop count only gets the last four 8-bits of the converted __m128i type. Basically what I want is to count all 16 numbers inside the __m128i type but I'm not sure what intrinsic function call to make after creating the variable popA . Somehow popA has to be converted into an integer that contains all the 128-bits of information? I suppose theres _mm_cvtsi128_si64 and using a few shuffle few

SSE multiplication 16 x uint8_t

阅读更多关于 SSE multiplication 16 x uint8_t

问题 I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8 ? 回答1: There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows: inline __m128i _mm_mullo_epi8(__m128i a, __m128i b) { __m128i zero = _mm_setzero_si128(); __m128i Alo = _mm_cvtepu8_epi16(a); __m128i Ahi = _mm_unpackhi_epi8(a

Counting the number of leading zeros in a 128-bit integer

阅读更多关于 Counting the number of leading zeros in a 128-bit integer

问题 How can I count the number of leading zeros in a 128-bit integer ( uint128_t ) efficiently? I know GCC's built-in functions: __builtin_clz , __builtin_clzl , __builtin_clzll __builtin_ffs , __builtin_ffsl , __builtin_ffsll However, these functions only work with 32- and 64-bit integers. I also found some SSE instructions: __lzcnt16 , __lzcnt , __lzcnt64 As you may guess, these only work with 16-, 32- and 64-bit integers. Is there any similar, efficient built-in functionality for 128-bit

Storing two x86 32 bit registers into 128 bit xmm register

阅读更多关于 Storing two x86 32 bit registers into 128 bit xmm register

问题 Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register? movd xmm0, edx movd xmm1, eax pshufd xmm0, xmm0, $1 por xmm0, xmm1 So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678. Thanks 回答1: With SSE 4.1 you can use movd xmm0, eax / pinsrd xmm0, edx, 1 and do it in 2 instructions. For older CPUs you can use 2 x movd and then punpckldq for a total of 3 instructions: movd xmm0, edx movd xmm1, eax punpckldq xmm0, xmm1 回答2: I

Optimzing SSE-code

阅读更多关于 Optimzing SSE-code

问题 I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough. Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation. The inner