intrinsics

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

浪子不回头ぞ 提交于 2021-02-20 18:42:04
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

我的梦境 提交于 2021-02-20 18:40:50
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

試著忘記壹切 提交于 2021-02-20 06:50:27
问题 I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times. Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast. The simple C++ loop case looks like this int _iNum; const

How to extract 8 integers from a 256 vector using intel intrinsics?

雨燕双飞 提交于 2021-02-19 02:08:35
问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

How to extract 8 integers from a 256 vector using intel intrinsics?

旧时模样 提交于 2021-02-19 02:05:56
问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

AVX 4-bit integers

[亡魂溺海] 提交于 2021-02-18 12:12:32
问题 I need to perform the following operation: w[i] = scale * v[i] + point scale and point are fixed, whereas v[] is a vector of 4-bit integers. I need to compute w[] for the arbitrary input vector v[] and I want to speed up the process using AVX intrinsics. However, v[i] is a vector of 4-bit integers. The question is how to perform operations on 4-bit integers using intrinsics? I could use 8-bit integers and perform operations that way, but is there a way to do the following: [a,b] + [c,d] = [a

Summing 8-bit integers in __m512i with AVX intrinsics

喜夏-厌秋 提交于 2021-02-15 07:40:34
问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);

Summing 8-bit integers in __m512i with AVX intrinsics

浪尽此生 提交于 2021-02-15 07:40:20
问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

无人久伴 提交于 2021-02-11 15:52:04
问题 Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call so _mm256_storeu_pd((double *)cij2,vecC); I have no idea why this changed anything... I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2. if( q == 0) { __m256d vecA; __m256d vecB; __m256d vecC; for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) { double cij = C[i+j*lda]; double

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

旧巷老猫 提交于 2021-02-11 15:51:57
问题 Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call so _mm256_storeu_pd((double *)cij2,vecC); I have no idea why this changed anything... I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2. if( q == 0) { __m256d vecA; __m256d vecB; __m256d vecC; for (int i = 0; i < M; ++i) for (int j = 0; j < N; ++j) { double cij = C[i+j*lda]; double