sse | 易学教程

Choosing SSE instruction execution domains in mixed contexts

阅读更多关于 Choosing SSE instruction execution domains in mixed contexts

问题 I am playing with a bit of SSE assembly code in which I do not have enough xmm registers to keep all the temporary results and useful constants in registers at the same time. As a workaround, for some constant vectors that have identical components, I “compress” several vectors into a single xmm register, xmm14 below. I use the pshufd instruction to decompress the constant vector I need. This instruction has a bit of latency, but since it takes a source and a destination register, it is

AVX/SSE round floats down and return vector of ints?

阅读更多关于 AVX/SSE round floats down and return vector of ints?

问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

AVX/SSE round floats down and return vector of ints?

阅读更多关于 AVX/SSE round floats down and return vector of ints?

How to make premultiplied alpha function faster using SIMD instructions?

阅读更多关于 How to make premultiplied alpha function faster using SIMD instructions?

问题 I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel). for (int i = 0, max = width * height * 4; i < max; i+=4) { data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255; data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255; data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255; } You will find below my current implementation but I think it could be much

Matrix-Vector and Matrix-Matrix multiplication using SSE

阅读更多关于 Matrix-Vector and Matrix-Matrix multiplication using SSE

问题 I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands. The dimensions of matrices and vectors are always multiples of 4. I managed to write the vector-vector multiplication function that looks like this: void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size) { int i; __declspec(align(16))__m128 *p_m = (__m128*)m; __declspec(align(16))__m128 *p_n = (__m128*)n; __declspec(align(16))__m128 *p

Use both SSE2 intrinsics and gcc inline assembler

阅读更多关于 Use both SSE2 intrinsics and gcc inline assembler

问题 I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example: #include <emmintrin.h> int main() { __m128i test = _mm_setzero_si128(); asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : ); } When compiled with gcc version 4.6.1 I get: >gcc asm_xmm.c asm_xmm.c: In function ‘main’: asm_xmm.c:10:3: error: matching constraint references invalid operand number asm_xmm.c:7:5: error: matching

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

阅读更多关于 I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

阅读更多关于 I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

Expand the lower two 32-bit floats of an xmm register to the whole xmm register

阅读更多关于 Expand the lower two 32-bit floats of an xmm register to the whole xmm register

问题 What is the most efficient way in Intel x86 assembly to do the following operation ( a , b are 32-bit floats): From xmm1: [-, -, a, b] to xmm1: [a, a, b, b] I could not find any useful instructions. My idea is to copying a and b to other registers and then shift the xmm1 register 4 bytes and move a or b to the lowest 4 bytes. 回答1: You're looking for unpcklps xmm1, xmm1 (https://www.felixcloutier.com/x86/unpcklps) to interleave the low elements from a register with itself: low element ->

What does AT&T syntax do about ambiguity between other mnemonics and operand-size suffixes?

阅读更多关于 What does AT&T syntax do about ambiguity between other mnemonics and operand-size suffixes?

问题 In AT&T syntax instructions often have to be suffixed with the appropriate operand size, with q for operations on 64-bit operands. However in MMX and SSE there is also movq instruction, with the q being in the original Intel mnemonic and not an additional suffix. So how will this be represented in AT&T? Is another q suffix needed like movqq %mm1, %mm0 movqq %xmm1, %xmm0 or not? And if there are any other instructions that end like AT&T suffixes (like paddd , slld ), do they work the same way?