sse

Choosing SSE instruction execution domains in mixed contexts

夙愿已清 提交于 2021-02-07 19:43:33
问题 I am playing with a bit of SSE assembly code in which I do not have enough xmm registers to keep all the temporary results and useful constants in registers at the same time. As a workaround, for some constant vectors that have identical components, I “compress” several vectors into a single xmm register, xmm14 below. I use the pshufd instruction to decompress the constant vector I need. This instruction has a bit of latency, but since it takes a source and a destination register, it is

AVX/SSE round floats down and return vector of ints?

拟墨画扇 提交于 2021-02-07 08:20:53
问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

AVX/SSE round floats down and return vector of ints?

▼魔方 西西 提交于 2021-02-07 08:17:27
问题 Is there a way using AVX/SSE to take a vector of floats, round-down and produce a vector of ints? All the floor intrinsic methods seem to produce a final vector of floating point, which is odd because rounding produces an integer! 回答1: SSE has conversion from FP to integer with your choice of truncation (towards zero) or the current rounding mode (normally the IEEE default mode, nearest with tiebreaks rounding to even. Like nearbyint() , unlike round() where the tiebreak is away-from-0. If

How to make premultiplied alpha function faster using SIMD instructions?

↘锁芯ラ 提交于 2021-02-07 06:38:12
问题 I'm looking for some SSE/AVX advice to optimize a routine that premultiplies RGB channel with its alpha channel: RGB * alpha / 255 (+ we keep the original alpha channel). for (int i = 0, max = width * height * 4; i < max; i+=4) { data[i] = static_cast<uint16_t>(data[i] * data[i+3]) / 255; data[i+1] = static_cast<uint16_t>(data[i+1] * data[i+3]) / 255; data[i+2] = static_cast<uint16_t>(data[i+2] * data[i+3]) / 255; } You will find below my current implementation but I think it could be much

Matrix-Vector and Matrix-Matrix multiplication using SSE

懵懂的女人 提交于 2021-02-07 04:28:19
问题 I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands. The dimensions of matrices and vectors are always multiples of 4. I managed to write the vector-vector multiplication function that looks like this: void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size) { int i; __declspec(align(16))__m128 *p_m = (__m128*)m; __declspec(align(16))__m128 *p_n = (__m128*)n; __declspec(align(16))__m128 *p

Use both SSE2 intrinsics and gcc inline assembler

北城以北 提交于 2021-02-07 02:49:32
问题 I have tried to mix SSE2 intrinsics and inline assembler in gcc. But if I specify a variable as xmm0/register as input then in some cases I get a compiler error. Example: #include <emmintrin.h> int main() { __m128i test = _mm_setzero_si128(); asm ("pxor %%xmm0, %%xmm0" : : "xmm0" (test) : ); } When compiled with gcc version 4.6.1 I get: >gcc asm_xmm.c asm_xmm.c: In function ‘main’: asm_xmm.c:10:3: error: matching constraint references invalid operand number asm_xmm.c:7:5: error: matching

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

倾然丶 夕夏残阳落幕 提交于 2021-02-05 11:51:07
问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting

I've some problems understanding how AVX shuffle intrinsics are working for 8 bits

烈酒焚心 提交于 2021-02-05 11:48:05
问题 I'm trying to pack 16 bits data to 8 bits by using _mm256_shuffle_epi8 but the result i have is not what i'm expecting. auto srcData = _mm256_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32); __m256i vperm = _mm256_setr_epi8( 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); auto result = _mm256_shuffle_epi8(srcData, vperm); I'm expecting

Expand the lower two 32-bit floats of an xmm register to the whole xmm register

南楼画角 提交于 2021-02-05 07:26:05
问题 What is the most efficient way in Intel x86 assembly to do the following operation ( a , b are 32-bit floats): From xmm1: [-, -, a, b] to xmm1: [a, a, b, b] I could not find any useful instructions. My idea is to copying a and b to other registers and then shift the xmm1 register 4 bytes and move a or b to the lowest 4 bytes. 回答1: You're looking for unpcklps xmm1, xmm1 (https://www.felixcloutier.com/x86/unpcklps) to interleave the low elements from a register with itself: low element ->

What does AT&T syntax do about ambiguity between other mnemonics and operand-size suffixes?

谁说我不能喝 提交于 2021-02-05 07:12:05
问题 In AT&T syntax instructions often have to be suffixed with the appropriate operand size, with q for operations on 64-bit operands. However in MMX and SSE there is also movq instruction, with the q being in the original Intel mnemonic and not an additional suffix. So how will this be represented in AT&T? Is another q suffix needed like movqq %mm1, %mm0 movqq %xmm1, %xmm0 or not? And if there are any other instructions that end like AT&T suffixes (like paddd , slld ), do they work the same way?