sse | 易学教程

Is it possible to vectorize this nested for with SSE?

阅读更多关于 Is it possible to vectorize this nested for with SSE?

问题 I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form: for (int j=-halfHeight; j<=halfHeight; ++j) { for(int i=-halfWidth; i<=halfWidth; ++i) { const float rx = ofsx + j * a12; const float ry = ofsy + j * a22; float wx = rx + i * a11; float wy = ry + i * a21; const int x = (int) floor(wx); const int y = (int) floor(wy); if

how to optimise double dereferencing?

阅读更多关于 how to optimise double dereferencing?

问题 Very specific optimisation task. I have 3 arrays: const char* inputTape const int* inputOffset, organised in a group of four char* outputTapeoutput which i must assemble output tape from input, according to following 5 operations: int selectorOffset = inputOffset[4*i]; char selectorValue = inputTape[selectorOffset]; int outputOffset = inputOffset[4*i+1+selectorValue]; char outputValue = inputTape[outputOffset]; outputTape[i] = outputValue; // store byte and then advance counter. All

SSE ints vs. floats practice

阅读更多关于 SSE ints vs. floats practice

问题 When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats? Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions ( <, <=, == ) which this conversion, I hope, should retain completely. 回答1: Expand my comments into an answer. Basically you weighing the following trade-off: Stick with integer: Integer SSE is low-latency, high throughput. (dual issue on Sandy

Horizontal xor of two SSE values

阅读更多关于 Horizontal xor of two SSE values

问题 I would need to do horizontal xor of two 128bit integers (by 32bit integers) and combine the results to one 64bit integer. So operation like this: uint32_t x0[4]; uint32_t x1[4]; uint32_t xor0 = x0[0]; uint32_t xor1 = x1[0]; for (int i = 1; i < 4; ++i) { xor0 ^= x0[i]; xor1 ^= x1[i]; } uint64_t xor = uint64_t(xor1) << 32 | xor0; I finally found following code, that seems to work: __m128i x0 = ...; __m128i x1 = ...; __m128i xor64_0 = _mm_unpackhi_epi64(x0, x1); __m128i xor64_1 = _mm_unpacklo

best way to shuffle across AVX lanes?

阅读更多关于 best way to shuffle across AVX lanes?

问题 There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere. I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations: on entry: x0 contains {a0, a1} x1 contains {a2, a3} x2 contains {a4, a5} x3 contains {a6, a7} on exit: y0 contains {a0, a1, a2, a3} y1 contains {a1, a2, a3, a4} y2 contains {a2, a3, a4, a5} y3 contains {a3

Fast vectorized conversion from RGB to BGRA

阅读更多关于 Fast vectorized conversion from RGB to BGRA

问题 In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE . Assume a 32-bit machine, and would like to use intrinsics . I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions. The routine to be vectorized is as follows... void RGB8ToBGRX8(int w, const void *in, void *out) { int i; int width = w; const unsigned char

Get an arbitrary float from a simd register at runtime?

阅读更多关于 Get an arbitrary float from a simd register at runtime?

问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and

Loop is not vectorized when variable extent is used

阅读更多关于 Loop is not vectorized when variable extent is used

问题 Version A code is not vectorized while version B code is vectorized. How to make version A vectorize and keep the variable extents (without using literal extents)? The nested loop is for multiplication with broadcasting as in numpy library of python and matlab. Description of broadcasting in numpy library is here. Version A code (no std::vector. no vectorization.) This only uses imull (%rsi), %edx in .L169 , which is not a SIMD instruction. gcc godbolt #include <iostream> #include <stdint.h>

Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

阅读更多关于 Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

问题 So, here is what I am trying to accomplish. In my C++ project that has to be compiled with Microsoft Visual Studio 2015 or above , I need to have some code have different versions depending on the newest SIMD instrunction set available in the CPU of the user, among: SSE , SSE2 , SSE3 , SSSE3 , SSE4.1 , SSE4.2 , AVX , AVX2 and AVX512 . Since what I am look for at this point is compile-time CPU dispatching , my first guess was that it could be easily accomplished using compiler macros. However,

SSE Intrinsics and loop unrolling

阅读更多关于 SSE Intrinsics and loop unrolling

问题 I am attempting to optimise some loops and I have managed but I wonder if I have only done it partially correct. Say for example that I have this loop: for(i=0;i<n;i++){ b[i] = a[i]*2; } unrolling this by a factor of 3, produces this: int unroll = (n/4)*4; for(i=0;i<unroll;i+=4) { b[i] = a[i]*2; b[i+1] = a[i+1]*2; b[i+2] = a[i+2]*2; b[i+3] = a[i+3]*2; } for(;i<n;i++) { b[i] = a[i]*2; } Now is the SSE translation equivalent: __m128 ai_v = _mm_loadu_ps(&a[i]); __m128 two_v = _mm_set1_ps(2); _