avx

Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?

陌路散爱 提交于 2019-12-24 21:37:44
问题 I'm thinking about writing a SIMD vector math library, so as a quick benchmark I wrote a program that does 100 million (4 float) vector element-wise multiplications and adds them to a cumulative total. For my classic, non-SIMD variation I just made a struct with 4 floats and wrote my own multiply function "multiplyTwo" that multiplies two such structs element wise and returns another struct. For my SIMD variation I used "immintrin.h" along with __m128, _mm_set_ps, and _mm_mul_ps. I'm running

The Effect of Architecture When Using SSE / AVX Intrinisics

只愿长相守 提交于 2019-12-24 13:34:06
问题 I wonder how does a Compiler treats Intrinsics. If one uses SSE2 Intrinsics (Using #include <emmintrin.h> ) and compile with -mavx flag. What will the compiler generate? Will it generate AVX or SSE code? If one uses AVX2 Intrinsics (Using #include <immintrin.h> ) and compile with -msse2 flag. What will the compiler generate? Will it generate SSE Only or AVX code? How does compilers treat Intrinsics? If one uses Intrinsics, does it help the compiler understand the dependency in the loop for

Compile with AVX2 support and run

↘锁芯ラ 提交于 2019-12-24 12:27:34
问题 I have a very big library and I want to compile it with AVX2 support (but my processor supports inly AVX). This library also has internal runtime checks whether a processor support AVX2 or not. Something like this: #if __AVX2__ if (support_avx2) { // vectorized code } #endif // simple C++ code I was able to compile the library with AVX2 support, but when I run tests I have got at the very beginning: Illegal instruction: 4 Any ideas? The goal is to compile the library with all available

reading/writing a matrix with a stride much larger than its width causes a big loss in performance

吃可爱长大的小学妹 提交于 2019-12-24 11:59:25
问题 I'm doing dense matrix multiplication on 1024x1024 matrices. I do this using loop blocking/tiling using 64x64 tiles. I have created a highly optimized 64x64 matrix multiplication function (see the end of my question for the code). gemm64(float *a, float *b, float *c, int stride). Here is the code which runs over the tiles. A 1024x1204 matrix which has 16x16 tiles. for(int i=0; i<16; i++) { for(int j=0; j<16; j++) { for(int k=0; k<16; k++) { gemm64(&a[64*(i*1024 + k)], &b[64*(k*1024 + j)], &c

Deinterleave and convert float to uint16_t efficiently

北城余情 提交于 2019-12-24 11:36:43
问题 I need to deinterleave a packed image buffer (YUVA) of floats to planar buffers. I would also like to convert these float s to uint16_t , but this is really slow. My question is: How do I speed this up by using intrinsics? void deinterleave(char* pixels, int rowBytes, char *bufferY, char *bufferU, char *bufferV, char *bufferA) { // Scaling factors (note min. values are actually negative) (limited range) const float yuva_factors[4][2] = { { 0.07306f, 1.09132f }, // Y { 0.57143f, 0.57143f }, //

how to optimise double dereferencing?

爷,独闯天下 提交于 2019-12-24 00:34:02
问题 Very specific optimisation task. I have 3 arrays: const char* inputTape const int* inputOffset, organised in a group of four char* outputTapeoutput which i must assemble output tape from input, according to following 5 operations: int selectorOffset = inputOffset[4*i]; char selectorValue = inputTape[selectorOffset]; int outputOffset = inputOffset[4*i+1+selectorValue]; char outputValue = inputTape[outputOffset]; outputTape[i] = outputValue; // store byte and then advance counter. All

SSE ints vs. floats practice

时光毁灭记忆、已成空白 提交于 2019-12-24 00:33:47
问题 When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats? Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions ( <, <=, == ) which this conversion, I hope, should retain completely. 回答1: Expand my comments into an answer. Basically you weighing the following trade-off: Stick with integer: Integer SSE is low-latency, high throughput. (dual issue on Sandy

Find 4 minimal values in 4 __m256d registers

北城以北 提交于 2019-12-23 22:29:02
问题 I cannot figure out how to implement: __m256d min(__m256d A, __m256d B, __m256d C, __m256d D) { __m256d result; // result should contain 4 minimal values out of 16 : A[0], A[1], A[2], A[3], B[0], ... , D[3] // moreover it should be result[0] <= result[1] <= result[2] <= result[2] return result; } Any ideas of how to use _mm256_min_pd , _mm256_max_pd and shuffles/permutes in a smart way? ================================================== This where I got so far, after: __m256d T = _mm256_min

best way to shuffle across AVX lanes?

ε祈祈猫儿з 提交于 2019-12-23 21:22:02
问题 There are questions with similar titles, but my question relates to one very specific use case not covered elsewhere. I have 4 __128d registers (x0, x1, x2, x3) and I want to recombine their content in 5 __256d registers (y0, y1, y2, y3, y4) as follows, in preparation of other calculations: on entry: x0 contains {a0, a1} x1 contains {a2, a3} x2 contains {a4, a5} x3 contains {a6, a7} on exit: y0 contains {a0, a1, a2, a3} y1 contains {a1, a2, a3, a4} y2 contains {a2, a3, a4, a5} y3 contains {a3

Get an arbitrary float from a simd register at runtime?

落花浮王杯 提交于 2019-12-23 20:10:44
问题 I want to access an arbitrary float from a simd register. I know that I can do things like: float get(const __m128i& a, const int idx){ // editor's note: this type-puns the FP bit-pattern to int and converts to float return _mm_extract_ps(a,idx); } or float get(const __m128i& a, const int idx){ return _mm_cvtss_f32(_mm_shuffle_ps(a,_MM_SHUFFLE(0,0,0,idx)); } or even using a shift instead of a shuffle. The problem is that these all require idx to be known at compile time (shuffle, shift, and