simd

Fast interleave 2 double arrays into an array of structs with 2 float and 1 int (loop invariant) member, with SIMD double->float conversion?

試著忘記壹切 提交于 2021-02-20 06:50:27
问题 I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times. Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast. The simple C++ loop case looks like this int _iNum; const

Replacing memcpy with neon intrinsics

北战南征 提交于 2021-02-20 04:26:28
问题 I am trying to beat the "memcpy" function by writing the neon intrinsics for the same . Below is my logic : uint8_t* m_input; //Size as 400 x300 uint8_t* m_output; //Size as 400 x300 //not mentioning the complete code base for memory creat memcpy(m_output, m_input, sizeof(m_output[0]) * 300* 400); Neon: int32_t ht_index,wd_index; uint8x16_t vector8x16_image; for(int32_t htI =0;htI < m_roiHeight;htI++){ ht_index = htI * m_roiWidth ; for(int32_t wdI = 0;wdI < m_roiWidth;wdI+=16){ wd_index = ht

find nan in array of doubles using simd

試著忘記壹切 提交于 2021-02-19 02:18:03
问题 This question is very similar to: SIMD instructions for floating point equality comparison (with NaN == NaN) Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0. I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/ My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like

How to extract 8 integers from a 256 vector using intel intrinsics?

雨燕双飞 提交于 2021-02-19 02:08:35
问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

How to extract 8 integers from a 256 vector using intel intrinsics?

旧时模样 提交于 2021-02-19 02:05:56
问题 I'm trying to enhance the performance of my code by using the 256bit vector (Intel intrinsics - AVX). I have an I7 Gen.4 (Haswell architecture) processor supporting SSE1 to SSE4.2 and AVX/AVX2 Extensions. This is the code snippet that I'm trying to enhance: /* code snipet */ kfac1 = kfac + factor; /* 7 cycles for 7 additions */ kfac2 = kfac1 + factor; kfac3 = kfac2 + factor; kfac4 = kfac3 + factor; kfac5 = kfac4 + factor; kfac6 = kfac5 + factor; kfac7 = kfac6 + factor; k1fac1 = k1fac +

Optimizing horizontal boolean reduction in ARM NEON

时光总嘲笑我的痴心妄想 提交于 2021-02-18 10:59:09
问题 I'm experimenting with a cross-platform SIMD library ala ecmascript_simd aka SIMD.js, and part of this is providing a few "horizontal" SIMD operations. In particular, the API that library offers includes any(<boolN x M>) -> bool and all(<boolN x M>) -> bool functions, where <T x K> is a vector of K elements of type T and boolN is an N -bit boolean, i.e. all ones or all zeros, as SSE and NEON return for their comparison operations. For example, let v be a <bool32 x 4> (a 128-bit vector), it

IntStream leads to array elements being wrongly set to 0 (JVM Bug, Java 11)

爷,独闯天下 提交于 2021-02-18 03:52:32
问题 In the class P below, the method test seems to return identically false : import java.util.function.IntPredicate; import java.util.stream.IntStream; public class P implements IntPredicate { private final static int SIZE = 33; @Override public boolean test(int seed) { int[] state = new int[SIZE]; state[0] = seed; for (int i = 1; i < SIZE; i++) { state[i] = state[i - 1]; } return seed != state[SIZE - 1]; } public static void main(String[] args) { long count = IntStream.range(0, 0x0010_0000)

How to implement sign function with SSE3?

一世执手 提交于 2021-02-16 13:08:38
问题 1) Is there a way to efficiently implement sign function using SSE3 (no SSE4) with the following characteristics? the input is a float vector __m128 . the output should be also __m128 with [-1.0f, 0.0f, 1.0f] as its values I tried this, but it didn't work (though I think it should): inputVal = _mm_set_ps(-0.5, 0.5, 0.0, 3.0); comp1 = _mm_cmpgt_ps(_mm_setzero_ps(), inputVal); comp2 = _mm_cmpgt_ps(inputVal, _mm_setzero_ps()); comp1 = _mm_castsi128_ps(_mm_castps_si128(comp1)); comp2 = _mm

Summing 8-bit integers in __m512i with AVX intrinsics

喜夏-厌秋 提交于 2021-02-15 07:40:34
问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);

Summing 8-bit integers in __m512i with AVX intrinsics

浪尽此生 提交于 2021-02-15 07:40:20
问题 AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8 , yet. _mm512_reduce_add_ps //horizontal sum of 16 floats _mm512_reduce_add_pd //horizontal sum of 8 doubles _mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers _mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers Basically, I need to implement MAGIC in the following snippet. __m512i all_ones = _mm512_set1_epi16(1);