avx | 易学教程

Unexpectedly good performance with openmp parallel for loop

阅读更多关于 Unexpectedly good performance with openmp parallel for loop

问题 I have edited my question after previous comments (especially @Zboson) for better readability I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here) Here is my test code: #include <math.h> #include <stdlib.h> #include <stdio.h> #include <omp.h>

Using AVX CPU instructions: Poor performance without “/arch:AVX”

阅读更多关于 Using AVX CPU instructions: Poor performance without “/arch:AVX”

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to include this: #include "immintrin.h" and then you can use intrinsics AVX functions like _mm256_mul_ps , _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning: warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX It seems VS2010 actually does not use AVX

What's missing/sub-optimal in this memcpy implementation?

阅读更多关于 What's missing/sub-optimal in this memcpy implementation?

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation : __forceinline // Since Size is usually known, // most useless code will be optimized out // if the function is inlined. void* myMemcpy(char* Dst, const char* Src, size_t Size) { void* start = Dst; for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) ) { __m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++); _mm256_storeu_si256(((__m256i* &)Dst)++, ymm); } #define CPY_1B *((uint8_t * &)Dst)++ = *((const

Where is Clang's '_mm256_pow_ps' intrinsic?

阅读更多关于 Where is Clang's '_mm256_pow_ps' intrinsic?

问题 I can't seem to find the intrinsics for either _mm_pow_ps or _mm256_pow_ps, both of which are supposed to be included with 'immintrin.h'. Does Clang not define these or are they in a header I'm not including? 回答1: That's not an intrinsic; it's an Intel library function name that confusingly uses the same naming scheme as actual intrinsics. There's no vpowps instruction. (AVX512ER on Xeon Phi does have the semi-related vexp2ps instruction...) For functions like that and _mm_sin_ps to be usable

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

阅读更多关于 Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

gcc 5.3 with -O3 -mavx -mtune=haswell for x86-64 makes surprisingly bulky code to handle potentially-misaligned inputs for code like: // convenient simple example of compiler input // I'm not actually interested in this for any real program void floatmul(float *a) { for (int i=0; i<1024 ; i++) a[i] *= 2; } clang uses unaligned load/store instructions, but gcc does a scalar intro/outro and an aligned vector loop: It peels off the first up-to-7 unaligned iterations, fully unrolling that into a sequence of vmovss xmm0, DWORD PTR [rdi] vaddss xmm0, xmm0, xmm0 ; multiply by two vmovss DWORD PTR

Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

阅读更多关于 Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

问题 AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1 . XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry, and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that

Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

阅读更多关于 Disable AVX-optimized functions in glibc (LD_HWCAP_MASK, /etc/ld.so.nohwcap) for valgrind & gdb record

问题 Modern x86_64 linux with glibc will detect that CPU has support of AVX extension and will switch many string functions from generic implementation to AVX-optimized version (with help of ifunc dispatchers: 1, 2). This feature can be good for performance, but it prevents several tool like valgrind (older libVEXs, before valgrind-3.8) and gdb's " target record " (Reverse Execution) from working correctly (Ubuntu "Z" 17.04 beta, gdb 7.12 .50.20170207-0ubuntu2, gcc 6.3.0-8ubuntu1 20170221, Ubuntu

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

阅读更多关于 How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1 . After manipulating the mask using bit operations ( BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8 , i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask . What is the best way to do this? Edit: I need to perform the inverse because the intrinsic _mm256_blendv_epi8 accepts only __m256i type mask instead of uint32_t . As such, in the

Fastest way to unpack 32 bits to a 32 byte SIMD vector

阅读更多关于 Fastest way to unpack 32 bits to a 32 byte SIMD vector

问题 Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte. Edit: to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the byte on zero. Best I could at the moment is 2 PSHUFB and having a mask register for each position. If the uint32_t is a bitmap, then the corresponding vector elements should be 0 or non-0. (i.e. so

Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

阅读更多关于 Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

问题 gcc 5.3 with -O3 -mavx -mtune=haswell for x86-64 makes surprisingly bulky code to handle potentially-misaligned inputs for code like: // convenient simple example of compiler input // I'm not actually interested in this for any real program void floatmul(float *a) { for (int i=0; i<1024 ; i++) a[i] *= 2; } clang uses unaligned load/store instructions, but gcc does a scalar intro/outro and an aligned vector loop: It peels off the first up-to-7 unaligned iterations, fully unrolling that into a