avx512

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

浪子不回头ぞ 提交于 2021-02-20 18:42:04
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

我的梦境 提交于 2021-02-20 18:40:50
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

Gather AVX2&512 intrinsic for 16-bit integers?

五迷三道 提交于 2021-02-11 12:15:20
问题 Imagine this piece of code: void Function(int16 *src, int *indices, float *dst, int cnt, float mul) { for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul; }; This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output. 回答1: There is indeed no instruction to gather 16bit integers, but

Truth-table reduction to ternary logic operations, vpternlog

对着背影说爱祢 提交于 2021-02-06 10:52:12
问题 I have many truth-tables of many variables (7 or more) and I use a tool (eg logic friday 1) to simplify the logic formula. I could do that by hand but that is much too error prone. These formula I then translate to compiler intrinsics (eg _mm_xor_epi32) which works fine. Question : with vpternlog I can make ternary logic operations. But I'm not aware of a method to simplify my truth-tables to sequences of vpternlog instructions that are (somewhat) efficient. I'm not asking if someone knows a

AVX 512 vs AVX2 performance for simple array processing loops [closed]

假装没事ソ 提交于 2020-05-13 14:49:05
问题 Closed. This question needs debugging details. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . I'm currently working on some optimizations and comparing vectorization possibilities for DSP applications, that seem ideal for AVX512, since these are just simple uncorrelated array processing loops. But on a new i9 I didn't measure any reasonable improvements when using AVX512 compared to AVX2. Any

Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask?

狂风中的少年 提交于 2020-05-09 02:27:56
问题 Writing a ZMM register can leave a Skylake-X (or similar) CPU in a state of reduced max-turbo indefinitely. (SIMD instructions lowering CPU frequency and Dynamically determining where a rogue AVX-512 instruction is executing) Presumably Ice Lake is similar. ( Workaround: not a problem for zmm16..31 , according to @BeeOnRope's comments which I quoted in Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions? So this strlen could just use vpxord xmm16,xmm16,xmm16

How to load a avx-512 zmm register from a ioremap() address?

跟風遠走 提交于 2020-04-16 02:58:10
问题 My goal is to create a PCIe transaction with more than 64b payload. For that I need to read an ioremap() address. For 128b and 256b I can use xmm and ymm registers respectively and that works as expected. Now, I'd like to do the same for 512b zmm registers (memory-like storage?!) A code under license I'm not allowed to show here, uses assembly code for 256b: void __iomem *addr; uint8_t datareg[32]; [...] // Read memory address to ymm (to have 256b at once): asm volatile("vmovdqa %0,%%ymm1" :