avx512

Why does AVX512-IFMA support only 52-bit ints?

只谈情不闲聊 提交于 2019-12-10 15:17:39
问题 From the value we can infer that it uses the same components as double-precision floating-point hardware. But double has 53 bits of mantissa, so why is AVX512-IFMA limited to 52 bits? 回答1: IEEE-754 double precision actually only has 52 explicitly stored bits, the 53rd bit (the most significant bit) is an implicit 1. 来源: https://stackoverflow.com/questions/28862012/why-does-avx512-ifma-support-only-52-bit-ints

Choice between aligned vs. unaligned x86 SIMD instructions

早过忘川 提交于 2019-12-10 03:29:07
问题 There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] B. And the ones that work with unaligned memory addresses, that will not raise such exception: movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax]

Choice between aligned vs. unaligned x86 SIMD instructions

橙三吉。 提交于 2019-12-05 03:21:53
There are generally two types of SIMD instructions: A. Ones that work with aligned memory addresses, that will raise general-protection (#GP) exception if the address is not aligned on the operand size boundary: movaps xmm0, xmmword ptr [rax] vmovaps ymm0, ymmword ptr [rax] vmovaps zmm0, zmmword ptr [rax] B. And the ones that work with unaligned memory addresses, that will not raise such exception: movups xmm0, xmmword ptr [rax] vmovups ymm0, ymmword ptr [rax] vmovups zmm0, zmmword ptr [rax] But I'm just curious, why would I want to shoot myself in the foot and use aligned memory instructions

How do AVX512 rounding modes work (or is NDISASM simply confused)?

狂风中的少年 提交于 2019-12-01 23:10:29
问题 I’m trying to understand the specific AVX512F instruction vcvtps2udq . The signature of the instruction is VCVTPS2UDQ zmm1 {k1}{z}, zmm2/m512/m32bcst{er} . The manual info is below. In an attempt to understand the new rounding modes, the following code snippet is assembled with NASM (2.12.02) vcvtps2udq zmm0,zmm1 vcvtps2udq zmm0,zmm1,{rz-sae} vcvtps2udq xmm0,xmm1 Deassembling the results with NDISASM (2.12.02) gives a lot of confusion and the following codes: 62F17C4879C1 vcvtps2udq zmm0,zmm1

When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?

微笑、不失礼 提交于 2019-12-01 17:27:54
When I do a writemasked AVX-512 store, like so: vmovdqu8 [rsi] {k1}, zmm0 Will the instruction fault if some portion of the memory accessed at [rsi, rsi + 63] is not mapped but the writemask is zero for all those locations (i.e., the data is not actually modified due to the mask). Another way of asking it is if these AVX-512 masked stores have a similar fault suppression ability to vmaskmov introduced in AVX. No fault is raised if masked out elements touch invalid memory. Here's some Windows test code to prove that masking does indeed suppress memory faults. #include <immintrin.h> #include

When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements?

大兔子大兔子 提交于 2019-12-01 15:53:11
问题 When I do a writemasked AVX-512 store, like so: vmovdqu8 [rsi] {k1}, zmm0 Will the instruction fault if some portion of the memory accessed at [rsi, rsi + 63] is not mapped but the writemask is zero for all those locations (i.e., the data is not actually modified due to the mask). Another way of asking it is if these AVX-512 masked stores have a similar fault suppression ability to vmaskmov introduced in AVX. 回答1: No fault is raised if masked out elements touch invalid memory. Here's some

What is meant by “fixing up” floats?

自闭症网瘾萝莉.ら 提交于 2019-12-01 04:43:27
I was looking through the instruction set in AVX-512 and noticed a set of fixup instructions. Some examples : _mm512_fixupimm_pd, _mm512_mask_fixupimm_pd, _mm512_maskz_fixupimm_pd _mm512_fixupimm_round_pd, _mm512_mask_fixupimm_round_pd, _mm512_maskz_fixupimm_round_pd What is meant here by "fixing up"? That's a great question. Intel's answer (my bold) is here: This instruction is specifically intended for use in fixing up the results of arithmetic calculations involving one source so that they match the spec, although it is generally useful for fixing up the results of multiple-instruction

Fallback implementation for conflict detection in AVX2

喜你入骨 提交于 2019-12-01 03:31:29
AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict. Basically I need an AVX2 equivalent for __mm256i detect_conflict(__mm256i a) { __mm256i cd = _mm256_conflict_epi32(a); return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0)); } The only way I could think of is to use

Fallback implementation for conflict detection in AVX2

你说的曾经没有我的故事 提交于 2019-11-30 23:20:04
问题 AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict. Basically I need an AVX2 equivalent for __mm256i detect_conflict(__mm256i a) { __mm256i cd = _mm256_conflict_epi32

Horizontal add with __m512 (AVX512)

泪湿孤枕 提交于 2019-11-30 17:23:33
问题 How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no _mm512_hadd_ps. The Intel intrinsics guide documents _mm512_reduce_add_ps. It doesn't actually correspond to a single instruction but its existence suggests there is an optimal method, but it doesn't appear to be defined in the header files that come with