avx | 易学教程

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

阅读更多关于 SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

阅读更多关于 SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

阅读更多关于 Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

问题 I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times. Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast. The simple C++ loop case looks like this int _iNum; const

find nan in array of doubles using simd

阅读更多关于 find nan in array of doubles using simd

问题 This question is very similar to: SIMD instructions for floating point equality comparison (with NaN == NaN) Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0. I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/ My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like

How to extract 8 integers from a 256 vector using intel intrinsics?

阅读更多关于 How to extract 8 integers from a 256 vector using intel intrinsics?

问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

How to extract 8 integers from a 256 vector using intel intrinsics?

阅读更多关于 How to extract 8 integers from a 256 vector using intel intrinsics?

AVX 4-bit integers

阅读更多关于 AVX 4-bit integers

问题 I need to perform the following operation: w[i] = scale * v[i] + point scale and point are fixed, whereas v[] is a vector of 4-bit integers. I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers. The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following: [a,b] + [c,d] = [a

Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

阅读更多关于 Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

问题 I current run BOINC across a number of servers which have GPUs. The servers run both GPU and CPU BOINC apps. As AVX and SSE slow down the CPU freq when being used within a CPU app, I have to be selective which CPU/GPU I run together, as some GPU apps get bottle necked (slower run time completion) where as others do not. At present some CPU apps are named so it is clear to see if they use AVX but most are not. Therefore is there any command I can run, and some way of viewing, to see if any of

Summing 8-bit integers in __m512i with AVX intrinsics

阅读更多关于 Summing 8-bit integers in __m512i with AVX intrinsics

问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);

Summing 8-bit integers in __m512i with AVX intrinsics

阅读更多关于 Summing 8-bit integers in __m512i with AVX intrinsics