avx

Fast memory transpose with SSE, AVX, and OpenMP

孤人 提交于 2019-11-29 07:53:08
问题 I need a fast memory transpose algorithm for my Gaussian convolution function in C/C++. What I do now is convolute_1D transpose convolute_1D transpose It turns out that with this method the filter size has to be large (or larger than I expected) or the transpose takes longer than the convolution (e.g. for a 1920x1080 matrix the convolution takes the same time as the transpose for a filter size of 35). The current transpose algorithm I am using uses loop blocking/tiling along with SSE and

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

僤鯓⒐⒋嵵緔 提交于 2019-11-29 07:12:42
Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance? If alignment is important for the generated code, how can I

horizontal sum of 8 packed 32bit floats

别等时光非礼了梦想. 提交于 2019-11-29 05:16:32
If I have 8 packed 32-bit floating point numbers ( __m256 ), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions? float sum(__m256 x); ///< returns sum of all 8 elements float max(__m256 x); ///< returns the maximum of all 8 elements float min(__m256 x); ///< returns the minimum of all 8 elements Quickly jotted down here (and hence untested): float sum(__m256 x) { __m128 hi = _mm256_extractf128_ps(x, 1); __m128 lo = _mm256_extractf128

Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?

雨燕双飞 提交于 2019-11-29 03:41:50
I'm writing some AVX code and I need to load from potentially unaligned memory. I'm currently loading 4 doubles , hence I would use intrinsic instruction _mm256_loadu_pd ; the code I've written is: __m256d d1 = _mm256_loadu_pd(vInOut + i*4); I've then compiled with options -O3 -mavx -g and subsequently used objdump to get the assembler code plus annotated code and line ( objdump -S -M intel -l avx.obj ). When I look into the underlying assembler code, I find the following: vmovupd xmm0,XMMWORD PTR [rsi+rax*1] vinsertf128 ymm0,ymm0,XMMWORD PTR [rsi+rax*1+0x10],0x1 I was expecting to see this:

How to quickly count bits into separate bins in a series of ints on Sandy Bridge?

牧云@^-^@ 提交于 2019-11-29 03:37:14
Update: Please read the code, it is NOT about counting bits in one int Is it possible to improve performance of the following code with some clever assembler? uint bit_counter[64]; void Count(uint64 bits) { bit_counter[0] += (bits >> 0) & 1; bit_counter[1] += (bits >> 1) & 1; // .. bit_counter[63] += (bits >> 63) & 1; } Count is in the inner-most loop of my algorithm. Update: Architecture: x86-64, Sandy Bridge, so SSE4.2, AVX1 and older tech can be used, but not AVX2 or BMI1/2. bits variable has almost random bits (close to half zeros and half ones) Maybe you can do 8 at once, by taking 8 bits

reduction with OpenMP with SSE/AVX

▼魔方 西西 提交于 2019-11-29 02:33:36
I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to: inline float sum_scalar_openmp2(const float a[], const size_t N) { float sum = 0.0f; #pragma omp parallel { float sum_private = 0.0f; #pragma omp parallel for nowait for(int i=0; i<N; i++) { sum_private += a[i]; } #pragma omp atomic sum += sum_private; } return sum; } I got this idea from the follow link: http://bisqwit.iki.fi/story/howto/openmp/#ReductionClause But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the

Using SIMD/AVX/SSE for tree traversal

不羁岁月 提交于 2019-11-28 23:26:42
问题 I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multiple tree nodes in the cache line (van emde Boas layout), tree traversal seems to be instruction-bottlenecked. Being kinda new to SIMD/AVX/SSE instructions, I would like to know from experts in that topic whether it would be possible to compare multiple nodes at once to a value and then find out which tree path to follow

Sorting 64-bit structs using AVX?

天大地大妈咪最大 提交于 2019-11-28 23:22:15
问题 I have a 64-bit struct which represents several pieces of data, one of which is a floating point value: struct MyStruct{ uint16_t a; uint16_t b; float f; }; and I have four of these structs in, lets say an std::array<MyStruct, 4> is it possible to use AVX to sort the array, in terms of the float member MyStruct::f ? 回答1: Sorry this answer is messy; it didn't all get written at once and I'm lazy. There is some duplication. I have 4 separate ideas: Normal sorting, but moving the struct as a

FMA3 in GCC: how to enable

南楼画角 提交于 2019-11-28 21:13:10
问题 I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't

Parallel programming using Haswell architecture [closed]

旧城冷巷雨未停 提交于 2019-11-28 16:35:39
I want to learn about parallel programming using Intel's Haswell CPU microarchitecture. About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?. Can you recommend books, tutorials, internet resources, courses? Thanks! Z boson It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some useful resources. There are several parallel computing technologies that can be employed: MIMD, SIMD,