simd

Is it possible to vectorize this nested for with SSE?

徘徊边缘 提交于 2019-12-24 03:01:07
问题 I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form: for (int j=-halfHeight; j<=halfHeight; ++j) { for(int i=-halfWidth; i<=halfWidth; ++i) { const float rx = ofsx + j * a12; const float ry = ofsy + j * a22; float wx = rx + i * a11; float wy = ry + i * a21; const int x = (int) floor(wx); const int y = (int) floor(wy); if

Avoiding invalid memory load with SIMD instructions

ⅰ亾dé卋堺 提交于 2019-12-24 02:33:37
问题 I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses: float X[SIZE]; vector float V0; unsigned FLOAT_VEC_SIZE = sizeof(vector float); for (int load_index =0; load_index < SIZE; load_index+=FLOAT_VEC_SIZE) { V0 = vec_ld(load_index, X); /* some computation involving V0*/ } Now if SIZE is not a multiple of FLOAT_VEC_SIZE, it is possible that V0 contains some invalid memory elements in the last loop iteration. One way to avoid that is

SSE ints vs. floats practice

时光毁灭记忆、已成空白 提交于 2019-12-24 00:33:47
问题 When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats? Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions ( <, <=, == ) which this conversion, I hope, should retain completely. 回答1: Expand my comments into an answer. Basically you weighing the following trade-off: Stick with integer: Integer SSE is low-latency, high throughput. (dual issue on Sandy

Find 4 minimal values in 4 __m256d registers

北城以北 提交于 2019-12-23 22:29:02
问题 I cannot figure out how to implement: __m256d min(__m256d A, __m256d B, __m256d C, __m256d D) { __m256d result; // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3] // moreover it should be result[0] <= result[1] <= result[2] <= result[2] return result; } Any ideas of how to use _mm256_min_pd , _mm256_max_pd and shuffles/permutes in a smart way? ================================================== This where I got so far, after: __m256d T = _mm256_min

Complex data reorganization with vector instructions

蓝咒 提交于 2019-12-23 22:06:59
问题 I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below: ABC DEF GHI JKL becomes ABBC DEEF GHHI JKKL Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ? This needs to be performed repeatedly, so pre-stored masks or constants are allowed. 回答1: By far your best bet is to use a byte shuffle ( pshufb ) . Shifting within elements isn't enough by itself, since JKL has to move farther to the right than DEF , etc. etc. So you

best way to shuffle across AVX lanes?

ε祈祈猫儿з 提交于 2019-12-23 21:22:02
问题 There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere. I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations: on entry: x0 contains {a0, a1} x1 contains {a2, a3} x2 contains {a4, a5} x3 contains {a6, a7} on exit: y0 contains {a0, a1, a2, a3} y1 contains {a1, a2, a3, a4} y2 contains {a2, a3, a4, a5} y3 contains {a3

Fast vectorized conversion from RGB to BGRA

我的未来我决定 提交于 2019-12-23 20:14:22
问题 In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE . Assume a 32-bit machine, and would like to use intrinsics . I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions. The routine to be vectorized is as follows... void RGB8ToBGRX8(int w, const void *in, void *out) { int i; int width = w; const unsigned char

Get an arbitrary float from a simd register at runtime?

落花浮王杯 提交于 2019-12-23 20:10:44
问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and

Why might this SIMD array-adding sample not be demonstrating any performance gains over a naive implementation?

时光总嘲笑我的痴心妄想 提交于 2019-12-23 20:09:26
问题 class Program { static void Main(string[] args) { Console.WriteLine(Vector.IsHardwareAccelerated ? "SIMD supported" : "SIMD not supported."); var rand = new Random(); var numNums = 10000000; var arr1 = Enumerable.Repeat(0, numNums).Select(x => (int) (rand.NextDouble() * 100)).ToArray(); var arr2 = Enumerable.Repeat(0, numNums).Select(x => (int) (rand.NextDouble() * 100)).ToArray(); var simdResult = new int [numNums]; var conventionalResult = new int [numNums]; var watch = System.Diagnostics

Using SIMD in a Game Engine Math Library by using function pointers ~ A good idea?

荒凉一梦 提交于 2019-12-23 19:15:18
问题 I have been reading Game Engine Books since I was 14 (At that time I didn't understand a thing:P) Now quite some years later I wanted to start programming the Mathmatical Basis for my Game Engine. I've been thinking long about how to design this 'library'. (Which I mean as "Organized set of files") Every few years new SIMD instructionsets come out, and I wouldn't want them to go to waste. (Tell me if I am wrong about this.) I wanted to at least have the following properties: Making it able to