simd | 易学教程

Is it possible to vectorize this nested for with SSE?

阅读更多关于 Is it possible to vectorize this nested for with SSE?

问题 I've never written assembly code for SSE optimization, so sorry if this is a noob question. In this aritcle is explained how to vectorize a for with a conditional statement. However, my code (taken from here ) is of the form: for (int j=-halfHeight; j<=halfHeight; ++j) { for(int i=-halfWidth; i<=halfWidth; ++i) { const float rx = ofsx + j * a12; const float ry = ofsy + j * a22; float wx = rx + i * a11; float wy = ry + i * a21; const int x = (int) floor(wx); const int y = (int) floor(wy); if

Avoiding invalid memory load with SIMD instructions

阅读更多关于 Avoiding invalid memory load with SIMD instructions

问题 I am loading elements from memory using SIMD load instructions, let say using Altivec, assuming aligned addresses: float X[SIZE]; vector float V0; unsigned FLOAT_VEC_SIZE = sizeof(vector float); for (int load_index =0; load_index < SIZE; load_index+=FLOAT_VEC_SIZE) { V0 = vec_ld(load_index, X); /* some computation involving V0*/ } Now if SIZE is not a multiple of FLOAT_VEC_SIZE, it is possible that V0 contains some invalid memory elements in the last loop iteration. One way to avoid that is

SSE ints vs. floats practice

阅读更多关于 SSE ints vs. floats practice

问题 When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats? Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions ( <, <=, == ) which this conversion, I hope, should retain completely. 回答1: Expand my comments into an answer. Basically you weighing the following trade-off: Stick with integer: Integer SSE is low-latency, high throughput. (dual issue on Sandy

Find 4 minimal values in 4 __m256d registers

阅读更多关于 Find 4 minimal values in 4 __m256d registers

问题 I cannot figure out how to implement: __m256d min(__m256d A, __m256d B, __m256d C, __m256d D) { __m256d result; // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3] // moreover it should be result[0] <= result[1] <= result[2] <= result[2] return result; } Any ideas of how to use _mm256_min_pd , _mm256_max_pd and shuffles/permutes in a smart way? ================================================== This where I got so far, after: __m256d T = _mm256_min

Complex data reorganization with vector instructions

阅读更多关于 Complex data reorganization with vector instructions

问题 I need to load and rearrange 12 bytes into 16 (or 24 into 32) following the pattern below: ABC DEF GHI JKL becomes ABBC DEEF GHHI JKKL Can you suggest efficient ways to achieve this using the SSE(2) and/or AVX(2) instructions ? This needs to be performed repeatedly, so pre-stored masks or constants are allowed. 回答1: By far your best bet is to use a byte shuffle ( pshufb ) . Shifting within elements isn't enough by itself, since JKL has to move farther to the right than DEF , etc. etc. So you

best way to shuffle across AVX lanes?

阅读更多关于 best way to shuffle across AVX lanes?

问题 There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere. I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations: on entry: x0 contains {a0, a1} x1 contains {a2, a3} x2 contains {a4, a5} x3 contains {a6, a7} on exit: y0 contains {a0, a1, a2, a3} y1 contains {a1, a2, a3, a4} y2 contains {a2, a3, a4, a5} y3 contains {a3

Fast vectorized conversion from RGB to BGRA

阅读更多关于 Fast vectorized conversion from RGB to BGRA

问题 In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE . Assume a 32-bit machine, and would like to use intrinsics . I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions. The routine to be vectorized is as follows... void RGB8ToBGRX8(int w, const void *in, void *out) { int i; int width = w; const unsigned char

Get an arbitrary float from a simd register at runtime?

阅读更多关于 Get an arbitrary float from a simd register at runtime?

问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and

Why might this SIMD array-adding sample not be demonstrating any performance gains over a naive implementation?

阅读更多关于 Why might this SIMD array-adding sample not be demonstrating any performance gains over a naive implementation?

问题 class Program { static void Main(string[] args) { Console.WriteLine(Vector.IsHardwareAccelerated ? "SIMD supported" : "SIMD not supported."); var rand = new Random(); var numNums = 10000000; var arr1 = Enumerable.Repeat(0, numNums).Select(x => (int) (rand.NextDouble() * 100)).ToArray(); var arr2 = Enumerable.Repeat(0, numNums).Select(x => (int) (rand.NextDouble() * 100)).ToArray(); var simdResult = new int [numNums]; var conventionalResult = new int [numNums]; var watch = System.Diagnostics

Using SIMD in a Game Engine Math Library by using function pointers ~ A good idea?

阅读更多关于 Using SIMD in a Game Engine Math Library by using function pointers ~ A good idea?

问题 I have been reading Game Engine Books since I was 14 (At that time I didn't understand a thing:P) Now quite some years later I wanted to start programming the Mathmatical Basis for my Game Engine. I've been thinking long about how to design this 'library'. (Which I mean as "Organized set of files") Every few years new SIMD instructionsets come out, and I wouldn't want them to go to waste. (Tell me if I am wrong about this.) I wanted to at least have the following properties: Making it able to