sse

Is it possible to vectorize this nested for with SSE?

徘徊边缘 提交于 2019-12-24 03:01:07
问题 I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form: for (int j=-halfHeight; j<=halfHeight; ++j) { for(int i=-halfWidth; i<=halfWidth; ++i) { const float rx = ofsx + j * a12; const float ry = ofsy + j * a22; float wx = rx + i * a11; float wy = ry + i * a21; const int x = (int) floor(wx); const int y = (int) floor(wy); if

how to optimise double dereferencing?

爷,独闯天下 提交于 2019-12-24 00:34:02
问题 Very specific optimisation task. I have 3 arrays: const char* inputTape const int* inputOffset, organised in a group of four char* outputTapeoutput which i must assemble output tape from input, according to following 5 operations: int selectorOffset = inputOffset[4*i]; char selectorValue = inputTape[selectorOffset]; int outputOffset = inputOffset[4*i+1+selectorValue]; char outputValue = inputTape[outputOffset]; outputTape[i] = outputValue; // store byte and then advance counter. All

SSE ints vs. floats practice

时光毁灭记忆、已成空白 提交于 2019-12-24 00:33:47
问题 When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats? Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions ( <, <=, == ) which this conversion, I hope, should retain completely. 回答1: Expand my comments into an answer. Basically you weighing the following trade-off: Stick with integer: Integer SSE is low-latency, high throughput. (dual issue on Sandy

Horizontal xor of two SSE values

邮差的信 提交于 2019-12-23 22:19:37
问题 I would need to do horizontal xor of two 128bit integers (by 32bit integers) and combine the results to one 64bit integer. So operation like this: uint32_t x0[4]; uint32_t x1[4]; uint32_t xor0 = x0[0]; uint32_t xor1 = x1[0]; for (int i = 1; i < 4; ++i) { xor0 ^= x0[i]; xor1 ^= x1[i]; } uint64_t xor = uint64_t(xor1) << 32 | xor0; I finally found following code, that seems to work: __m128i x0 = ...; __m128i x1 = ...; __m128i xor64_0 = _mm_unpackhi_epi64(x0, x1); __m128i xor64_1 = _mm_unpacklo

best way to shuffle across AVX lanes?

ε祈祈猫儿з 提交于 2019-12-23 21:22:02
问题 There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere. I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations: on entry: x0 contains {a0, a1} x1 contains {a2, a3} x2 contains {a4, a5} x3 contains {a6, a7} on exit: y0 contains {a0, a1, a2, a3} y1 contains {a1, a2, a3, a4} y2 contains {a2, a3, a4, a5} y3 contains {a3

Fast vectorized conversion from RGB to BGRA

我的未来我决定 提交于 2019-12-23 20:14:22
问题 In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE . Assume a 32-bit machine, and would like to use intrinsics . I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions. The routine to be vectorized is as follows... void RGB8ToBGRX8(int w, const void *in, void *out) { int i; int width = w; const unsigned char

Get an arbitrary float from a simd register at runtime?

落花浮王杯 提交于 2019-12-23 20:10:44
问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and

Loop is not vectorized when variable extent is used

拥有回忆 提交于 2019-12-23 17:13:38
问题 Version A code is not vectorized while version B code is vectorized. How to make version A vectorize and keep the variable extents (without using literal extents)? The nested loop is for multiplication with broadcasting as in numpy library of python and matlab. Description of broadcasting in numpy library is here. Version A code (no std::vector. no vectorization.) This only uses imull (%rsi), %edx in .L169 , which is not a SIMD instruction. gcc godbolt #include <iostream> #include <stdint.h>

Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

帅比萌擦擦* 提交于 2019-12-23 17:00:41
问题 So, here is what I am trying to accomplish. In my C++ project that has to be compiled with Microsoft Visual Studio 2015 or above , I need to have some code have different versions depending on the newest SIMD instrunction set available in the CPU of the user, among: SSE , SSE2 , SSE3 , SSSE3 , SSE4.1 , SSE4.2 , AVX , AVX2 and AVX512 . Since what I am look for at this point is compile-time CPU dispatching , my first guess was that it could be easily accomplished using compiler macros. However,

SSE Intrinsics and loop unrolling

荒凉一梦 提交于 2019-12-23 15:03:10
问题 I am attempting to optimise some loops and I have managed but I wonder if I have only done it partially correct. Say for example that I have this loop: for(i=0;i<n;i++){ b[i] = a[i]*2; } unrolling this by a factor of 3, produces this: int unroll = (n/4)*4; for(i=0;i<unroll;i+=4) { b[i] = a[i]*2; b[i+1] = a[i+1]*2; b[i+2] = a[i+2]*2; b[i+3] = a[i+3]*2; } for(;i<n;i++) { b[i] = a[i]*2; } Now is the SSE translation equivalent: __m128 ai_v = _mm_loadu_ps(&a[i]); __m128 two_v = _mm_set1_ps(2); _