avx | 易学教程

__m256d TRANSPOSE4 Equivalent?

阅读更多关于 __m256d TRANSPOSE4 Equivalent?

Intel has included __MM_TRANPOSE4_PS to transpose a 4x4 matrix of vectors. I'm wanting to do the equivalent with __m256d. However, I can't seem to figure out how to get _mm256_shuffle_pd in the same manner. _MM_TRANSPOSE4_PS Code #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) { \ __m128 tmp3, tmp2, tmp1, tmp0; \ \ tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \ tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \ tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \ tmp3 = _mm_shuffle_ps((row2), (row3), 0xEE); \ \ (row0) = _mm_shuffle_ps(tmp0, tmp1, 0x88); \ (row1) = _mm_shuffle_ps(tmp0, tmp1, 0xDD); \

Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

阅读更多关于 Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

As part of a compression algorithm, I am looking for the optimal way to achieve the following: I have a simple bitmap in a uint8_t . For example 01010011 What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint) One way to achieve this is by shuffling a vector of 8 x maxint into a vector of zeros. But that first requires me to expand my uint8_t to the right shuffle bitmap. I am wondering if there is a better way? Here is a solution (PaulR improved my solution, see the end of my answer or his answer) based on a variation of this question fastest-way-to-broadcast-32-bits

Half-precision floating-point arithmetic on Intel chips

阅读更多关于 Half-precision floating-point arithmetic on Intel chips

Is it possible to perform half-precision floating-point arithmetic on Intel chips? I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision floating-point numbers. [1] https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats Is it possible to perform half-precision floating-point arithmetic on Intel chips? Yes, apparently the on-chip GPU in Skylake and later has hardware support for FP16 and FP64 , as well as FP32. With new enough drivers you can use it via OpenCL.

Using AVX instructions disables exp() optimization?

阅读更多关于 Using AVX instructions disables exp() optimization?

I am writing a feed forward net in VC++ using AVX intrinsics. I am invoking this code via PInvoke in C#. My performance when calling a function that calculates a large loop including the function exp() is ~1000ms for a loopsize of 160M. As soon as I call any function that uses AVX intrinsics, and then subsequently use exp(), my performance drops to about ~8000ms for the same operation. Note that the function calculating the exp() is standard C, and the call that uses the AVX intrinsics can be completely unrelated in terms of data being processed. Some kind of flag is getting tripped somewhere

Aligned and unaligned memory access with AVX/AVX2 intrinsics

阅读更多关于 Aligned and unaligned memory access with AVX/AVX2 intrinsics

According to Intel's Software Developer Manual (sec. 14.9), AVX relaxed the alignment requirements of memory accesses. If data is loaded directly in a processing instruction, e.g. vaddps ymm0,ymm0,YMMWORD PTR [rax] the load address doesn't have to be aligned. However, if a dedicated aligned load instruction is used, such as vmovaps ymm0,YMMWORD PTR [rax] the load address has to be aligned (to multiples of 32), otherwise an exception is raised. What confuses me is the automatic code generation from intrinsics, in my case by gcc/g++ (4.6.3, Linux). Please have a look at the following test code:

Are older SIMD-versions available when using newer ones?

阅读更多关于 Are older SIMD-versions available when using newer ones?

When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately? In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports saving the AVX registers as well. You should still explicitly check for all the CPUID support you use in your

are static / static local SSE / AVX variables blocking a xmm / ymm register?

阅读更多关于 are static / static local SSE / AVX variables blocking a xmm / ymm register?

When using SSE intrinsics, often zero vectors are required. One way to avoid creating a zero variable inside a function whenever the function is called (each time effectively calling some xor vector instruction) would be to use a static local variable, as in static inline __m128i negate(__m128i a) { static __m128i zero = __mm_setzero_si128(); return _mm_sub_epi16(zero, a); } It seems the variable is only initialized when the function is called for the first time. (I checked this by calling a true function instead of the _mm_setzero_si128() intrinsic. It only seems to be possible in C++, not in

Using __m256d registers

阅读更多关于 Using __m256d registers

How do you use __m256d ? Say I want to use the Intel AVX instruction _mm256_add_pd on a simple Vector3 class with 3-64 bit double precision components ( x , y , and z ). What is the correct way to use this? Since x , y and z are members of the Vector3 class, _can I declare them in union with an __m256d variable? union Vector3 { struct { double x,y,z ; } ; __m256d _register ; // the Intel register? } ; Then can I go: Vector3 add( const Vector3& o ) { Vector3 result; result._register = _mm256_add_pd( _register, o._register ) ; // add 'em return result; } Is that going to work? Or do I need to

Scatter intrinsics in AVX

阅读更多关于 Scatter intrinsics in AVX

I can't find them in the Intel Intrinsic Guide v2.7. Do you know if AVX or AVX2 instruction sets support them? There are no scatter or gather instructions in the original AVX instruction set. AVX2 adds gather, but not scatter instructions. AVX512F includes both scatter and gather instructions. AVX512PF additionally provides prefetch variants of gather and scatter instructions. AVX512CD provides instructions to detect conflicts in scatter addresses. Intel MIC (aka Xeon Phi, Knights Corner) does include gather and scatter instructions, but it is a separate coprocessor, and it can not run normal

How to choose AVX compare predicate variants

阅读更多关于 How to choose AVX compare predicate variants

问题 In the Advanced Vector Extensions (AVX) the compare instructions like _m256_cmp_ps, the last argument is a compare predicate. The choices for the predicate overwhelm me. They seem to be a tripple of type, ordering, signaling. E.g. _CMP_LE_OS is 'less than or equal, ordered, signaling. For starters, is there a performance reason for selecting signaling or non signaling, and similarly, is ordered or unordered faster than the other? And what does 'non signaling' even mean? I can't find this in