sse | 易学教程

Add a constant value to a xmm register in x86

阅读更多关于 Add a constant value to a xmm register in x86

问题 How would I add 1 or 2 to the register xmm0 (double)? I can do it like this, but sure there must be an easier way: movsd xmm0, [ecx] xor eax, eax inc eax cvtsi2sd xmm1, eax addsd xmm0, xmm1 movsd [ecx], xmm0 Also would it be possible to do this with the floating point x87 instructions? This doesn't work for me: fld dword ptr [ecx] fld1 faddp fstp dword ptr [ecx] 回答1: You can keep a constant in memory or in another register: _1 dq 1.0 and addsd xmm1,[_1] or movsd xmm0,[_1] addsd xmm1,xmm0 If

performance of intrinsic functions with sse

阅读更多关于 performance of intrinsic functions with sse

问题 I am currently getting started with SSE. The answer to my previous question regarding SSE ( Mutiplying vector by constant using SSE ) brought me to the idea to test the difference between using intrinsics like _mm_mul_ps() and just using 'normal operators' (not sure what the best term is) like * . So i wrote two testing cases which only differ in way the result is calculated: Method 1: int main(void){ float4 a, b, c; a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); b.v = _mm_set_ps(-1.0f, -2.0f, -3

Fastest method of vectorized integer division by non-constant divisor

阅读更多关于 Fastest method of vectorized integer division by non-constant divisor

问题 Based on the answers/comments of this question i wrote a performance test with gcc 4.9.2 (MinGW64) to estimate which way of multiple integer division is faster, as following: #include <emmintrin.h> // SSE2 static unsigned short x[8] = {0, 55, 2, 62003, 786, 5555, 123, 32111}; // Dividend __attribute__((noinline)) static void test_div_x86(unsigned i){ for(; i; --i) x[0] /= i, x[1] /= i, x[2] /= i, x[3] /= i, x[4] /= i, x[5] /= i, x[6] /= i, x[7] /= i; } __attribute__((noinline)) static void

determinant calculation with SIMD

阅读更多关于 determinant calculation with SIMD

问题 Does there exist an approach for calculating the determinant of matrices with low dimensions (about 4), that works well with SIMD (neon, SSE, SSE2)? I am using a hand-expansion formula, which does not work so well. I am using SSE all the way to SSE3 and neon, both under linux. The matrix elements are all floats. 回答1: Here's my 5 cents. determinant of a 2x2 matrix: that's an exercise for the reader, should be simple to implement determinant of a 3x3 matrix: use the scalar triple product. This

Horizontal trailing maximum on AVX or SSE

阅读更多关于 Horizontal trailing maximum on AVX or SSE

问题 I have an __m256i register consisting of 16bit values and I want to get the maximum values on each trailing element which are zeroes. To give an example: input: 1 0 0 3 0 0 4 5 0 0 0 0 4 3 0 2 output: 1 1 1 3 3 3 4 5 5 5 5 5 4 3 3 2 Are there any efficient way of doing this on AVX or AVX architecture? Maybe with log(16) = 4 iterations? Addition: Any solution on 128 bit numbers with 8 uint_16's in it are appreciated also. 回答1: You can do this in log_2(SIMD_width) steps indeed. The idea is to

Storing individual doubles from a packed double vector using Intel AVX

阅读更多关于 Storing individual doubles from a packed double vector using Intel AVX

问题 I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d ), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode: __m256d *src; double *dst; int dst_dist; dst[0] = src[0]; dst[dst_dist] = src[1]; dst[2 * dst_dist] = src[2]; dst[3 * dst_dist] = src[3]; Using SSE, I

Most efficient way to convert vector of uint32 to vector of float?

阅读更多关于 Most efficient way to convert vector of uint32 to vector of float?

问题 x86 does not have an SSE instruction to convert from unsigned int32 to floating point. What would be the most efficient instruction sequence for achieving this? EDIT: To clarify, i want to do the vector sequence of the following scalar operation: unsigned int x = ... float res = (float)x; EDIT2: Here is a naive algorithm for doing a scalar conversion. unsigned int x = ... float bias = 0.f; if (x > 0x7fffffff) { bias = (float)0x80000000; x -= 0x80000000; } res = signed_convert(x) + bias; 回答1:

Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?

阅读更多关于 Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

问题 Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok. So far I'm using the following code, but profiler says it's slow on Ryzen 1800X : // Global constant const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1); // ... // function code __m256i x = /* computed here */; const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x),

What is the fastest way to do a SIMD gather without AVX(2)?

阅读更多关于 What is the fastest way to do a SIMD gather without AVX(2)?

问题 Assuming I have SSE to SSE4.1, but not AVX(2), what is the fastest way to load a packed memory layout like this (all 32-bit integers): a0 b0 c0 d0 a1 b1 c1 d1 a2 b2 c2 d2 a3 b3 c3 d3 Into four vectors a, b, c, d ? a: {a0, a1, a2, a3} b: {b0, b1, b2, b3} c: {c0, c1, c2, c3} d: {d0, d1, d2, d3} I'm not sure whether this is relevant or not, but in my actual application I have 16 vectors and as such a0 and a1 are 16*4 bytes apart in memory. 回答1: What you need here is 4 loads followed by a 4x4

Does using mix of pxor and xorps affect performance?

阅读更多关于 Does using mix of pxor and xorps affect performance?

问题 I've come across a fast CRC computation using PCLMULQDQ implementation. I see, that guys mix pxor and xorps instructions heavily like in the fragment below: movdqa xmm10, [rk9] movdqa xmm8, xmm0 pclmulqdq xmm0, xmm10, 0x11 pclmulqdq xmm8, xmm10, 0x0 pxor xmm7, xmm8 xorps xmm7, xmm0 movdqa xmm10, [rk11] movdqa xmm8, xmm1 pclmulqdq xmm1, xmm10, 0x11 pclmulqdq xmm8, xmm10, 0x0 pxor xmm7, xmm8 xorps xmm7, xmm1 Is there any practical reason for this? Performance boost? If yes, then what lies