sse

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

浪子不回头ぞ 提交于 2021-02-20 18:42:04
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

我的梦境 提交于 2021-02-20 18:40:50
问题 I am looking for efficient AVX (AVX512) implementation of // Given float u[8]; float v[8]; // Compute float a[8]; float b[8]; // Such that for ( int i = 0; i < 8; ++i ) { a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i]; b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i]; } I.e., I need to select element-wise into a from u and v based on mask , and into b based on !mask , where mask = (fabs(u) >= fabs(v)) element-wise. 回答1: I had this exact same problem just the other day. The solution I came up with

find nan in array of doubles using simd

試著忘記壹切 提交于 2021-02-19 02:18:03
问题 This question is very similar to: SIMD instructions for floating point equality comparison (with NaN == NaN) Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0. I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/ My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like

Stack alignment on x86

有些话、适合烂在心里 提交于 2021-02-18 21:12:17
问题 I had a mysterious bus error that occurred, on a x86 (32-bit) platform, when running code compiled with gcc-4.8.1 with -march=pentium4 . I traced the problem to an SSE instruction: movdqa %xmm5,0x50(%esp) with esp = 0xbfffedac. movdqa requires the address to be 16-byte aligned, which is not the case here, thus the bus error. The problem does not occur if compiling with -march=native (this is a Core-i3 processor). As far as I know, the only stack alignment guaranteed on Linux/x86 is 4-byte.

Ubuntu - how to tell if AVX or SSE, is current being used by CPU app?

与世无争的帅哥 提交于 2021-02-16 15:42:45
问题 I current run BOINC across a number of servers which have GPUs. The servers run both GPU and CPU BOINC apps. As AVX and SSE slow down the CPU freq when being used within a CPU app, I have to be selective which CPU/GPU I run together, as some GPU apps get bottle necked (slower run time completion) where as others do not. At present some CPU apps are named so it is clear to see if they use AVX but most are not. Therefore is there any command I can run, and some way of viewing, to see if any of

How to implement sign function with SSE3?

一世执手 提交于 2021-02-16 13:08:38
问题 1) Is there a way to efficiently implement sign function using SSE3 (no SSE4) with the following characteristics? the input is a float vector __m128 . the output should be also __m128 with [-1.0f, 0.0f, 1.0f] as its values I tried this, but it didn't work (though I think it should): inputVal = _mm_set_ps(-0.5, 0.5, 0.0, 3.0); comp1 = _mm_cmpgt_ps(_mm_setzero_ps(), inputVal); comp2 = _mm_cmpgt_ps(inputVal, _mm_setzero_ps()); comp1 = _mm_castsi128_ps(_mm_castps_si128(comp1)); comp2 = _mm

assignment with intel Intrinsics - horizontal add

不羁的心 提交于 2021-02-11 15:14:19
问题 I want sum up all elements of a big vector ary . My idea was to do it with a horizontal sum. const int simd_width = 16/sizeof(float); float helper[simd_width]; //take the first 4 elements const __m128 a4 = _mm_load_ps(ary); for(int i=0; i<N-simd_width; i+=simd_width){ const __m128 b4 = _mm_load_ps(ary+i+simd_width); //save temporary result in helper array _mm_store_ps(helper, _mm_hadd_ps(a4,b4)); //C const __m128 a4 = _mm_load_ps(helper); } I looked for a method, with which i can assign the

assignment with intel Intrinsics - horizontal add

青春壹個敷衍的年華 提交于 2021-02-11 15:13:06
问题 I want sum up all elements of a big vector ary . My idea was to do it with a horizontal sum. const int simd_width = 16/sizeof(float); float helper[simd_width]; //take the first 4 elements const __m128 a4 = _mm_load_ps(ary); for(int i=0; i<N-simd_width; i+=simd_width){ const __m128 b4 = _mm_load_ps(ary+i+simd_width); //save temporary result in helper array _mm_store_ps(helper, _mm_hadd_ps(a4,b4)); //C const __m128 a4 = _mm_load_ps(helper); } I looked for a method, with which i can assign the

why does “+=” gives me unexpected result in SSE instrinsic

五迷三道 提交于 2021-02-10 11:51:43
问题 There are two ways of implementation of accumulation in sse intrinsic. But one of them gets the wrong result. #include <smmintrin.h> int main(int argc, const char * argv[]) { int32_t A[4] = {10, 20, 30, 40}; int32_t B[8] = {-1, 2, -3, -4, -5, -6, -7, -8}; int32_t C[4] = {0, 0, 0, 0}; int32_t D[4] = {0, 0, 0, 0}; __m128i lv = _mm_load_si128((__m128i *)A); __m128i rv = _mm_load_si128((__m128i *)B); // way 1 unexpected rv += lv; _mm_store_si128((__m128i *)C, rv); // way 2 expected rv = _mm_load

why does “+=” gives me unexpected result in SSE instrinsic

爷,独闯天下 提交于 2021-02-10 11:51:32
问题 There are two ways of implementation of accumulation in sse intrinsic. But one of them gets the wrong result. #include <smmintrin.h> int main(int argc, const char * argv[]) { int32_t A[4] = {10, 20, 30, 40}; int32_t B[8] = {-1, 2, -3, -4, -5, -6, -7, -8}; int32_t C[4] = {0, 0, 0, 0}; int32_t D[4] = {0, 0, 0, 0}; __m128i lv = _mm_load_si128((__m128i *)A); __m128i rv = _mm_load_si128((__m128i *)B); // way 1 unexpected rv += lv; _mm_store_si128((__m128i *)C, rv); // way 2 expected rv = _mm_load