sse | 易学教程

Shift a __m128i of n bits

阅读更多关于 Shift a __m128i of n bits

问题 I have a __m128i variable and I need to shift its 128 bit value of n bits, i.e. like _mm_srli_si128 and _mm_slli_si128 work, but on bits instead of bytes. What is the most efficient way of doing this? 回答1: This is the best that I could come up with for left/right immediate shifts with SSE2: #include <stdio.h> #include <emmintrin.h> #define SHL128(v, n) \ ({ \ __m128i v1, v2; \ \ if ((n) >= 64) \ { \ v1 = _mm_slli_si128(v, 8); \ v1 = _mm_slli_epi64(v1, (n) - 64); \ } \ else \ { \ v1 = _mm_slli

Shift a __m128i of n bits

阅读更多关于 Shift a __m128i of n bits

Performance worsens when using SSE (Simple addition of integer arrays)

阅读更多关于 Performance worsens when using SSE (Simple addition of integer arrays)

问题 I'm trying to use SSE intrinsics to add two 32-bit signed int arrays. But I'm getting very poor performance compared to a linear addition. Platform - Intel Core i3 550, GCC 4.4.3, Ubuntu 10.04 (bit old, yeah) #define ITER 1000 typedef union sint4_u { __m128i v; sint32_t x[4]; } sint4; The functions: void compute(sint32_t *a, sint32_t *b, sint32_t *c) { sint32_t len = 96000; sint32_t i, j; __m128i x __attribute__ ((aligned(16))); __m128i y __attribute__ ((aligned(16))); sint4 z; for(j = 0; j <

Does compiler use SSE instructions for a regular C code?

阅读更多关于 Does compiler use SSE instructions for a regular C code?

问题 I see people using -msse -msse2 -mfpmath=sse flags by default hoping that this will improve performance. I know that SSE gets engaged when special vector types are used in the C code. But do these flags make any difference for regular C code? Does compiler use SSE to optimize regular C code? 回答1: Yes, modern compilers auto-vectorize with SSE2 if you compile with full optimization. clang enables it even at -O2, gcc at -O3. Even at -O1 or -Os, compilers will use SIMD load/store instructions to

How do I vectorize data_i16[0 to 15]?

阅读更多关于 How do I vectorize data_i16[0 to 15]?

问题 I'm on the Intel Intrinsic site and I can't figure out what combination of instructions I want. What I'd like to do is result = high_table[i8>>4] & low_table[i8&15] Where both table are 16bits (or more). shuffle seems like what I want (_mm_shuffle_epi8) however getting a 8bit value doesn't work for me. There doesn't seem to be a 16bit version and the non byte version seems to need the second param as an immediate value. How am I suppose to implement this? Do I call _mm_shuffle_epi8 twice for

How to move a floating-point constant value into an xmm register?

阅读更多关于 How to move a floating-point constant value into an xmm register?

问题 Is the only way to move a value into an xmm register by first moving the value into an integer register, dunno what they are called, and then into the xmm register e.g. mov [eax], (float)1000 ; store to memory movss xmm1,[eax] ; reload or mov eax, 1000 ; move-immediate integer cvtsi2ss xmm1,eax ; and convert or is there another way? Is there a way to directly move a value into a xmm register, something along the lines of: movss xmm1,(float)1000 ? 回答1: There are no instructions to load an SSE

SSE2 integer overflow checking

阅读更多关于 SSE2 integer overflow checking

问题 When using SSE2 instructions such as PADDD (i.e., the _mm_add_epi32 intrinsic), is there a way to check whether any of the operations overflowed? I thought that maybe a flag on the MXCSR control register may get set after an overflow, but I don't see that happening. For example, _mm_getcsr() prints the same value in both cases below (8064): #include <iostream> #include <emmintrin.h> using namespace std; void main() { __m128i a = _mm_set_epi32(1, 0, 0, 0); __m128i b = _mm_add_epi32(a, a); cout

SSE2 integer overflow checking

阅读更多关于 SSE2 integer overflow checking

How do I efficiently lookup 16bits in a 128bit SIMD vector? [duplicate]

阅读更多关于 How do I efficiently lookup 16bits in a 128bit SIMD vector? [duplicate]

问题 This question already has answers here : SSE/SIMD shift with one-byte element size / granularity? (2 answers) How do I vectorize data_i16[0 to 15]? (1 answer) Closed 3 days ago . I'm trying to implement the strategy described in an answer to How do I vectorize data_i16[0 to 15]? Code below. The spot I'd like to fix is the for(int i=0; i<ALIGN; i++) loop I'm new to SIMD. From what I can tell I'd load the high/low nibble table by writing const auto HI_TBL = _mm_load_si128((__m128i*)HighNibble)

AVX 256-bit vectors slightly slower than scalar (~10%) for STREAM-like double add loop on huge arrays, on Xeon Gold

阅读更多关于 AVX 256-bit vectors slightly slower than scalar (~10%) for STREAM-like double add loop on huge arrays, on Xeon Gold

问题 I am new to AVX512 instruction set and I write the following code as demo. #include <iostream> #include <array> #include <chrono> #include <vector> #include <cstring> #include <omp.h> #include <immintrin.h> #include <cstdlib> int main() { unsigned long m, n, k; m = n = k = 1 << 30; auto *a = static_cast<double*>(aligned_alloc(512, m*sizeof(double))); auto *b = static_cast<double*>(aligned_alloc(512, n*sizeof(double))); auto *c = static_cast<double*>(aligned_alloc(512, k*sizeof(double)));