avx

Using SIMD/AVX/SSE for tree traversal

浪尽此生 提交于 2019-11-30 02:29:36
I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multiple tree nodes in the cache line (van emde Boas layout), tree traversal seems to be instruction-bottlenecked. Being kinda new to SIMD/AVX/SSE instructions, I would like to know from experts in that topic whether it would be possible to compare multiple nodes at once to a value and then find out which tree path to follow further on. My research lead to the following question: How many cpu cycles/instructions are wasted on

Are different mmx, sse and avx versions complementary or supersets of each other?

冷暖自知 提交于 2019-11-30 01:02:54
I'm thinking I should familiarize myself with x86 SIMD extensions. But before I even began I ran into trouble. I can't find a good overview on which of them are still relevant. The x86 architecture has accumulated a lot of math/multimedia extensions over decades: MMX 3DNow! SSE SSE2 SSE3 SSSE3 SSE4 AVX AVX2 AVX512 Did I forget something? Are the newer ones supersets of the older ones and vice versa? Or are they complementary? Are some of them deprecated? Which of these are still relevant? I've heard references to "legacy SSE". Are some of them mutually exclusive? I.e. do they share the same

FMA3 in GCC: how to enable

核能气质少年 提交于 2019-11-30 00:11:36
I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile. SSE2: gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp AVX: gcc matrix.cpp -o matrix_gcc -O3 -mavx -fopenmp AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I

Intel AVX: 256-bits version of dot product for double precision floating point variables

我的未来我决定 提交于 2019-11-30 00:05:28
The Intel Advanced Vector Extensions (AVX) offers no dot product in the 256-bit version (YMM register) for double precision floating point variables . The "Why?" question have been very briefly treated in another forum ( here ) and on Stack Overflow ( here ). But the question I am facing is how to replace this missing instruction with other AVX instructions in an efficient way? The dot product in 256-bit version exists for single precision floating point variables ( reference here ): __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask); The idea is to find an efficient equivalent for this

How to divide a __m256i vector by an integer variable?

岁酱吖の 提交于 2019-11-29 22:01:55
问题 I want to divide an AVX2 vector by a constant. I visited this question and many other pages. Saw something that might help Fixed-point arithmetic and I didn't understand. So the problem is this division is the bottleneck. I tried two ways: First, casting to float and do the operation with AVX instruction: //outside the bottleneck: __m256i veci16; // containing some integer numbers (16x16-bit numbers) __m256 div_v = _mm256_set1_ps(div); //inside the bottlneck //some calculations which make

How to detect SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI availability at compile-time?

旧街凉风 提交于 2019-11-29 19:40:28
I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE/SSE2/AVX/AVX2/AVX-512/AVX-128-FMA/KCVI [1] is enabled by the compiler ? Ideally for GCC and Clang, but I can manage with only one of them. I'm not sure it is possible and perhaps I will use my own macro, but I'd prefer detecting it rather and asking the user to select it. [1] "KCVI" stands for Knights Corner Vector Instruction optimizations. Libraries like FFTW detect/utilize these newer instruction optimizations. Paul R Most compilers will automatically define: __SSE__ _

Handling zeroes in _mm256_rsqrt_ps()

故事扮演 提交于 2019-11-29 17:01:56
Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps() , looking around it seems that doing: _mm256_mul_ps(_mm256_rsqrt_ps(eightFloats), eightFloats); Is the way to go for that extra bit of performance and avoiding a pipeline stall. Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0) . What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions? _mm256_mul_ps(_mm256_rsqrt_ps(_mm256

Most efficient way to get a __m256 of horizontal sums of 8 source __m256 vectors

别来无恙 提交于 2019-11-29 16:57:16
I know how to sum one __m256 to get a single summed value. However, I have 8 vectors like Input 1: a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7], ....., ....., 8: h[0], h[1], h[2], h[3], h[4], a[5], a[6], a[7] Output a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7], ...., h[0]+h[1]+h[2]+h[3]+h[4]+h[5]+h[6]+h[7] My method. Curious if there is a better way. __m256 sumab = _mm256_hadd_ps(accumulator1, accumulator2); __m256 sumcd = _mm256_hadd_ps(accumulator3, accumulator4); __m256 sumef = _mm256_hadd_ps(accumulator5, accumulator6); __m256 sumgh = _mm256_hadd_ps(accumulator7, accumulator8); __m256 sumabcd

Reverse a AVX register containing doubles using a single AVX intrinsic

做~自己de王妃 提交于 2019-11-29 15:17:00
If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command? For example: If I had 4 floats in a SSE register, I could use: _mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3)); Can I do this using, maybe _mm256_permute2f128_pd() ? I don't think you can address each individual double using the above intrinsic. You actually need 2 permutes to do this: _mm256_permute2f128_pd() only permutes in 128-bit chunks. _mm256_permute_pd() does not permute across 128-bit boundaries. So you need to use both:

How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

微笑、不失礼 提交于 2019-11-29 14:43:38
How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU? I am about: pow(x, y) = exp(y*log(x)) I.e. do both exp() and log() AVX x86_64 instructions require certain known number of cycles? exp(): _mm256_exp_ps() log(): _mm256_log_ps() Or the number of cycles may vary depending on the exponential degree, is there the maximum number of cycles can cost exponentiation? The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD exp , log , or pow with the exception of pow(x,0.5) which is the square root. There are SIMD math libraries however which are