simd | 易学教程

How to use _mm_extract_epi8 function? [duplicate]

阅读更多关于 How to use _mm_extract_epi8 function? [duplicate]

问题 This question already has an answer here : _mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument (1 answer) Closed 11 months ago . I am using _mm_extract_epi8 (__m128i a, const int imm8) function, which has const int parameter. When I compile this c++ code, getting the following error message: Error C2057 expected constant expression __m128i a; for (int i=0; i<16; i++) { _mm_extract_epi8(a, i); // compilation error } How could I use this function in loop? 回答1: First of

Apache arrow, alignment and padding

阅读更多关于 Apache arrow, alignment and padding

问题 I want to use apache arrow because it enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. (https://arrow.apache.org/). From documentration (https://arrow.apache.org/docs/memory_layout.html), I understand that memory allocation make sure about 64 byte alignment. In order to verify this 64 bytes alignment, I use the __array_interface__ data member

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

阅读更多关于 What is the difference between _mm512_load_epi32 and _mm512_load_si512?

问题 The Intel intrinsics guide states simply that _mm512_load_epi32 : Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst and that _mm512_load_si512 : Load[s] 512-bits of integer data from memory into dst What is the difference between these two? The documentation isn't clear. 回答1: There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

阅读更多关于 What is the difference between _mm512_load_epi32 and _mm512_load_si512?

What efficient way to load x64 ymm register with 4 seperated doubles?

阅读更多关于 What efficient way to load x64 ymm register with 4 seperated doubles?

问题 What is the most efficient way to load a x64 ymm register with 4 doubles evenly spaced i.e. a contiguous set of doubles 0 1 2 3 4 5 6 7 8 9 10 .. 100 And i want to load for example 0, 10, 20, 30 4 doubles at any position i.e. i want to load for example 1, 6, 22, 43 回答1: The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up. VGATHERQPD ymm1, [rsi+xmm7*8], ymm2 Using dword indices specified in vm32x, gather double-pre-cision FP values from memory

Explaining the different types in Metal and SIMD

阅读更多关于 Explaining the different types in Metal and SIMD

问题 When working with Metal, I find there's a bewildering number of types and it's not always clear to me which type I should be using in which context. In Apple's Metal Shading Language Specification, there's a pretty clear table of which types are supported within a Metal shader file. However, there's plenty of sample code available that seems to use additional types that are part of SIMD. On the macOS (Objective-C) side of things, the Metal types are not available but the SIMD ones are and I'm

GCC couldn't vectorize 64-bit multiplication. Can 64-bit x 64-bit -> 128-bit widening multiplication be vectorized on AVX2?

阅读更多关于 GCC couldn't vectorize 64-bit multiplication. Can 64-bit x 64-bit -> 128-bit widening multiplication be vectorized on AVX2?

问题 I try to vectorize a CBRNG which uses 64bit widening multiplication. static __inline__ uint64_t mulhilo64(uint64_t a, uint64_t b, uint64_t* hip) { __uint128_t product = ((__uint128_t)a)*((__uint128_t)b); *hip = product>>64; return (uint64_t)product; } Is such a multiplication exists in a vectorized form in AVX2? 回答1: No. There's no 64 x 64 -> 128 bit arithmetic as a vector instruction. Nor is there a vector mulhi type instruction (high word result of multiply). [V]PMULUDQ can do 32 x 32 -> 64

Extract set bytes position from SIMD vector

阅读更多关于 Extract set bytes position from SIMD vector

问题 I run a bench of computations using SIMD intructions. These instructions return a vector of 16 bytes as result, named compare , with each byte being 0x00 or 0xff : 0 1 2 3 4 5 6 7 15 16 compare : 0x00 0x00 0x00 0x00 0xff 0x00 0x00 0x00 ... 0xff 0x00 Bytes set to 0xff mean I need to run the function do_operation(i) with i being the position of the byte . For instance, the above compare vector mean, I need to run this sequence of operations : do_operation(4); do_operation(15); Here is the

SSE2: How To Load Data From Non-Contiguous Memory Locations?

阅读更多关于 SSE2: How To Load Data From Non-Contiguous Memory Locations?

问题 I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1

How to clear the upper 128 bits of __m256 value?

阅读更多关于 How to clear the upper 128 bits of __m256 value?

问题 How can I clear the upper 128 bits of m2: __m256i m2 = _mm256_set1_epi32(2); __m128i m1 = _mm_set1_epi32(1); m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2)); m2 = _mm256_castsi128_si256(m1); don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”. At the same time I can easily do it in assembly: VMOVDQA xmm2, xmm2 //zeros upper ymm2 VMOVDQA xmm2, xmm1 Of course I'd not like to use "and" or _mm256