sse

Shift a __m128i of n bits

巧了我就是萌 提交于 2020-06-24 22:10:50
问题 I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this? 回答1: This is the best that I could come up with for left/right immediate shifts with SSE2: #include <stdio.h> #include <emmintrin.h> #define SHL128(v, n) \ ({ \ __m128i v1, v2; \ \ if ((n) >= 64) \ { \ v1 = _mm_slli_si128(v, 8); \ v1 = _mm_slli_epi64(v1, (n) - 64); \ } \ else \ { \ v1 = _mm_slli

Shift a __m128i of n bits

Deadly 提交于 2020-06-24 22:05:18
问题 I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this? 回答1: This is the best that I could come up with for left/right immediate shifts with SSE2: #include <stdio.h> #include <emmintrin.h> #define SHL128(v, n) \ ({ \ __m128i v1, v2; \ \ if ((n) >= 64) \ { \ v1 = _mm_slli_si128(v, 8); \ v1 = _mm_slli_epi64(v1, (n) - 64); \ } \ else \ { \ v1 = _mm_slli

Performance worsens when using SSE (Simple addition of integer arrays)

删除回忆录丶 提交于 2020-06-23 16:45:08
问题 I'm trying to use SSE intrinsics to add two 32-bit signed int arrays. But I'm getting very poor performance compared to a linear addition. Platform - Intel Core i3 550, GCC 4.4.3, Ubuntu 10.04 (bit old, yeah) #define ITER 1000 typedef union sint4_u { __m128i v; sint32_t x[4]; } sint4; The functions: void compute(sint32_t *a, sint32_t *b, sint32_t *c) { sint32_t len = 96000; sint32_t i, j; __m128i x __attribute__ ((aligned(16))); __m128i y __attribute__ ((aligned(16))); sint4 z; for(j = 0; j <

Does compiler use SSE instructions for a regular C code?

北慕城南 提交于 2020-05-24 20:34:07
问题 I see people using -msse -msse2 -mfpmath=sse flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does compiler use SSE to optimize regular C code? 回答1: Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3. Even at -O1 or -Os, compilers will use SIMD load/store instructions to

How do I vectorize data_i16[0 to 15]?

南笙酒味 提交于 2020-05-23 21:07:48
问题 I'm on the Intel Intrinsic site and I can't figure out what combination of instructions I want. What I'd like to do is result = high_table[i8>>4] & low_table[i8&15] Where both table are 16bits (or more). shuffle seems like what I want (_mm_shuffle_epi8) however getting a 8bit value doesn't work for me. There doesn't seem to be a 16bit version and the non byte version seems to need the second param as an immediate value. How am I suppose to implement this? Do I call _mm_shuffle_epi8 twice for

How to move a floating-point constant value into an xmm register?

烂漫一生 提交于 2020-05-15 08:33:22
问题 Is the only way to move a value into an xmm register by first moving the value into an integer register, dunno what they are called, and then into the xmm register e.g. mov [eax], (float)1000 ; store to memory movss xmm1,[eax] ; reload or mov eax, 1000 ; move-immediate integer cvtsi2ss xmm1,eax ; and convert or is there another way? Is there a way to directly move a value into a xmm register, something along the lines of: movss xmm1,(float)1000 ? 回答1: There are no instructions to load an SSE

SSE2 integer overflow checking

扶醉桌前 提交于 2020-05-10 03:51:30
问题 When using SSE2 instructions such as PADDD (i.e., the _mm_add_epi32 intrinsic), is there a way to check whether any of the operations overflowed? I thought that maybe a flag on the MXCSR control register may get set after an overflow, but I don't see that happening. For example, _mm_getcsr() prints the same value in both cases below (8064): #include <iostream> #include <emmintrin.h> using namespace std; void main() { __m128i a = _mm_set_epi32(1, 0, 0, 0); __m128i b = _mm_add_epi32(a, a); cout

SSE2 integer overflow checking

我怕爱的太早我们不能终老 提交于 2020-05-10 03:50:31
问题 When using SSE2 instructions such as PADDD (i.e., the _mm_add_epi32 intrinsic), is there a way to check whether any of the operations overflowed? I thought that maybe a flag on the MXCSR control register may get set after an overflow, but I don't see that happening. For example, _mm_getcsr() prints the same value in both cases below (8064): #include <iostream> #include <emmintrin.h> using namespace std; void main() { __m128i a = _mm_set_epi32(1, 0, 0, 0); __m128i b = _mm_add_epi32(a, a); cout

How do I efficiently lookup 16bits in a 128bit SIMD vector? [duplicate]

耗尽温柔 提交于 2020-04-30 06:29:30
问题 This question already has answers here : SSE/SIMD shift with one-byte element size / granularity? (2 answers) How do I vectorize data_i16[0 to 15]? (1 answer) Closed 3 days ago . I'm trying to implement the strategy described in an answer to How do I vectorize data_i16[0 to 15]? Code below. The spot I'd like to fix is the for(int i=0; i<ALIGN; i++) loop I'm new to SIMD. From what I can tell I'd load the high/low nibble table by writing const auto HI_TBL = _mm_load_si128((__m128i*)HighNibble)

AVX 256-bit vectors slightly slower than scalar (~10%) for STREAM-like double add loop on huge arrays, on Xeon Gold

回眸只為那壹抹淺笑 提交于 2020-04-11 04:56:06
问题 I am new to AVX512 instruction set and I write the following code as demo. #include <iostream> #include <array> #include <chrono> #include <vector> #include <cstring> #include <omp.h> #include <immintrin.h> #include <cstdlib> int main() { unsigned long m, n, k; m = n = k = 1 << 30; auto *a = static_cast<double*>(aligned_alloc(512, m*sizeof(double))); auto *b = static_cast<double*>(aligned_alloc(512, n*sizeof(double))); auto *c = static_cast<double*>(aligned_alloc(512, k*sizeof(double)));