avx

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

浪子不回头ぞ 提交于 2021-02-20 18:42:04
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

我的梦境 提交于 2021-02-20 18:40:50
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

試著忘記壹切 提交于 2021-02-20 06:50:27
问题 I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times. Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast. The simple C++ loop case looks like this int _iNum; const

find nan in array of doubles using simd

試著忘記壹切 提交于 2021-02-19 02:18:03
问题 This question is very similar to: SIMD instructions for floating point equality comparison (with NaN == NaN) Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0. I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/ My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like

How to extract 8 integers from a 256 vector using intel intrinsics?

雨燕双飞 提交于 2021-02-19 02:08:35
问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

How to extract 8 integers from a 256 vector using intel intrinsics?

旧时模样 提交于 2021-02-19 02:05:56
问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

AVX 4-bit integers

[亡魂溺海] 提交于 2021-02-18 12:12:32
问题 I need to perform the following operation: w[i] = scale * v[i] + point scale and point are fixed, whereas v[] is a vector of 4-bit integers. I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers. The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following: [a,b] + [c,d] = [a

Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

与世无争的帅哥 提交于 2021-02-16 15:42:45
问题 I current run BOINC across a number of servers which have GPUs. The servers run both GPU and CPU BOINC apps. As AVX and SSE slow down the CPU freq when being used within a CPU app, I have to be selective which CPU/GPU I run together, as some GPU apps get bottle necked (slower run time completion) where as others do not. At present some CPU apps are named so it is clear to see if they use AVX but most are not. Therefore is there any command I can run, and some way of viewing, to see if any of

Summing 8-bit integers in __m512i with AVX intrinsics

喜夏-厌秋 提交于 2021-02-15 07:40:34
问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);

Summing 8-bit integers in __m512i with AVX intrinsics

浪尽此生 提交于 2021-02-15 07:40:20
问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);